From makaplan at us.ibm.com Thu Sep 1 00:40:13 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 31 Aug 2016 19:40:13 -0400 Subject: [gpfsug-discuss] Data Replication In-Reply-To: References: Message-ID: You can leave out the WHERE ... AND POOL_NAME LIKE 'deep' - that is redundant with the FROM POOL 'deep' clause. In fact at a slight additional overhead in mmapplypolicy processing due to begin checked a little later in the game, you can leave out MISC_ATTRIBUTES NOT LIKE '%2%' since the code is smart enough to not operate on files already marked as replicate(2). I believe mmapplypolicy .... -I yes means do any necessary data movement and/or replication "now" Alternatively you can say -I defer, which will leave the files "ill-replicated" and then fix them up with mmrestripefs later. The -I yes vs -I defer choice is the same as for mmchattr. Think of mmapplypolicy as a fast, parallel way to do find ... | xargs mmchattr ... Advert: see also samples/ilm/mmfind -- the latest version should have an -xargs option From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 08/31/2016 04:44 PM Subject: Re: [gpfsug-discuss] Data Replication Sent by: gpfsug-discuss-bounces at spectrumscale.org Assuming your DeepFlash pool is named "deep", something like the following should work: RULE 'deepreplicate' migrate from pool 'deep' to pool 'deep' replicate(2) where MISC_ATTRIBUTES NOT LIKE '%2%' and POOL_NAME LIKE 'deep' "mmapplypolicy gpfs0 -P replicate-policy.pol -I yes" and possibly "mmrestripefs gpfs0 -r" afterwards. -jf On Wed, Aug 31, 2016 at 8:01 PM, Brian Marshall wrote: Daniel, So here's my use case: I have a Sandisk IF150 (branded as DeepFlash recently) with 128TB of flash acting as a "fast tier" storage pool in our HPC scratch file system. Can I set the filesystem replication level to 1 then write a policy engine rule to send small and/or recent files to the IF150 with a replication of 2? Any other comments on the proposed usage strategy are helpful. Thank you, Brian Marshall On Wed, Aug 31, 2016 at 10:32 AM, Daniel Kidger wrote: The other 'Exception' is when a rule is used to convert a 1 way replicated file to 2 way, or when only one failure group is up due to HW problems. It that case the (re-replication) is done by whatever nodes are used for the rule or command-line, which may include an NSD server. Daniel IBM Spectrum Storage Software +44 (0)7818 522266 Sent from my iPad using IBM Verse On 30 Aug 2016, 19:53:31, mimarsh2 at vt.edu wrote: From: mimarsh2 at vt.edu To: gpfsug-discuss at spectrumscale.org Cc: Date: 30 Aug 2016 19:53:31 Subject: Re: [gpfsug-discuss] Data Replication Thanks. This confirms the numbers that I am seeing. Brian On Tue, Aug 30, 2016 at 2:50 PM, Laurence Horrocks-Barlow < laurence at qsplace.co.uk> wrote: Its the client that does all the synchronous replication, this way the cluster is able to scale as the clients do the leg work (so to speak). The somewhat "exception" is if a GPFS NSD server (or client with direct NSD) access uses a server bases protocol such as SMB, in this case the SMB server will do the replication as the SMB client doesn't know about GPFS or its replication; essentially the SMB server is the GPFS client. -- Lauz On 30 August 2016 17:03:38 CEST, Bryan Banister wrote: The NSD Client handles the replication and will, as you stated, write one copy to one NSD (using the primary server for this NSD) and one to a different NSD in a different GPFS failure group (using quite likely, but not necessarily, a different NSD server that is the primary server for this alternate NSD). Cheers, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto: gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Brian Marshall Sent: Tuesday, August 30, 2016 9:59 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Data Replication All, If I setup a filesystem to have data replication of 2 (2 copies of data), does the data get replicated at the NSD Server or at the client? i.e. Does the client send 2 copies over the network or does the NSD Server get a single copy and then replicate on storage NSDs? I couldn't find a place in the docs that talked about this specific point. Thank you, Brian Marshall Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Sep 1 11:29:48 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 1 Sep 2016 10:29:48 +0000 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Aaron, GNR is a key differentiator for IBM's (and Lenovo's) Storage hardware appliance. ESS and GSS are otherwise commodity storage arrays connected to commodity NSD servers, albeit with a high degree of tuning and rigorous testing and validation. This competes with equivalent DDN and Seagate appliances as well other non s/w Raid offerings from other IBM partners. GNR only works for a small number of disk arrays and then only in certain configurations. GNR then might be thought of as 'firmware' for the hardware rather than part of a software defined products at is Spectrum Scale. If you beleive the viewpoint that hardware Raid 'is dead' then GNR will not be the only s/w Raid that will be used to underly Spectrum Scale. As well as vendor specific offerings from DDN, Seagate, etc. ZFS is likely to be a popular choice but is today not well understood or tested. This will change as more 3rd parties publish their experiences and tuning optimisations, and also as storage solution vendors bidding Spectrum Scale find they can't compete without a software Raid component in their offering. Disclaimer: the above are my own views and not necessarily an IBM official viewpoint. Daniel IBM Spectrum Storage Software +44 (0)7818 522266 Sent from my iPad using IBM Verse On 30 Aug 2016, 18:17:01, aaron.s.knister at nasa.gov wrote: From: aaron.s.knister at nasa.gov To: gpfsug-discuss at spectrumscale.org Cc: Date: 30 Aug 2016 18:17:01 Subject: Re: [gpfsug-discuss] gpfs native raid Thanks Christopher. I've tried GPFS on zvols a couple times and the write throughput I get is terrible because of the required sync=always parameter. Perhaps a couple of SSD's could help get the number up, though. -Aaron On 8/30/16 12:47 PM, Christopher Maestas wrote: > Interestingly enough, Spectrum Scale can run on zvols. Check out: > > http://files.gpfsug.org/presentations/2016/anl-june/LANL_GPFS_ZFS.pdf > > -cdm > > ------------------------------------------------------------------------ > On Aug 30, 2016, 9:17:05 AM, aaron.s.knister at nasa.gov wrote: > > From: aaron.s.knister at nasa.gov > To: gpfsug-discuss at spectrumscale.org > Cc: > Date: Aug 30, 2016 9:17:05 AM > Subject: [gpfsug-discuss] gpfs native raid > > Does anyone know if/when we might see gpfs native raid opened up for the > masses on non-IBM hardware? It's hard to answer the question of "why > can't GPFS do this? Lustre can" in regards to Lustre's integration with > ZFS and support for RAID on commodity hardware. > -Aaron > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discussUnless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Sep 1 12:22:47 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 1 Sep 2016 11:22:47 +0000 Subject: [gpfsug-discuss] Maximum value for data replication? In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bauer at cesnet.cz Thu Sep 1 14:30:23 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Thu, 1 Sep 2016 15:30:23 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks Message-ID: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Hello, I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the filesystem metadata from state: -m = 2 (default metadata replicas) - SATA disks (dataAndMetadata, failGroup=1) - SSDs (metadataOnly, failGroup=3) to the desired state: -m = 1 - SATA disks (dataOnly, failGroup=1) - SSDs (metadataOnly, failGroup=3) I have done the following steps in the following order: 1) change SATA disks to dataOnly (stanza file modifies the 'usage' attribute only): # mmchdisk fs1 change -F dataOnly_disks.stanza Attention: Disk parameters were changed. Use the mmrestripefs command with the -r option to relocate data and metadata. Verifying file system configuration information ... mmchdisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. 2) change default metadata replicas number 2->1 # mmchfs fs1 -m 1 3) run mmrestripefs as suggested by output of 1) # mmrestripefs fs1 -r Scanning file system metadata, phase 1 ... Error processing inodes. No space left on device mmrestripefs: Command failed. Examine previous error messages to determine cause. It is, however, still possible to create new files on the filesystem. When I return one of the SATA disks as a dataAndMetadata disk, the mmrestripefs command stops complaining about No space left on device. Both df and mmdf say that there is enough space both for data (SATA) and metadata (SSDs). Does anyone have an idea why is it complaining? Thanks, -- Miroslav Bauer -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From aaron.s.knister at nasa.gov Thu Sep 1 14:36:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 09:36:32 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: I must admit, I'm curious as to the reason you're dropping the replication factor from 2 down to 1. There are some serious advantages we've seen to having multiple metadata replicas, as far as error recovery is concerned. Could you paste an output of mmlsdisk for the filesystem? -Aaron On 9/1/16 9:30 AM, Miroslav Bauer wrote: > Hello, > > I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > filesystem metadata from state: > -m = 2 (default metadata replicas) > - SATA disks (dataAndMetadata, failGroup=1) > - SSDs (metadataOnly, failGroup=3) > to the desired state: > -m = 1 > - SATA disks (dataOnly, failGroup=1) > - SSDs (metadataOnly, failGroup=3) > > I have done the following steps in the following order: > 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > attribute only): > # mmchdisk fs1 change -F dataOnly_disks.stanza > Attention: Disk parameters were changed. > Use the mmrestripefs command with the -r option to relocate data and > metadata. > Verifying file system configuration information ... > mmchdisk: Propagating the cluster configuration data to all > affected nodes. This is an asynchronous process. > > 2) change default metadata replicas number 2->1 > # mmchfs fs1 -m 1 > > 3) run mmrestripefs as suggested by output of 1) > # mmrestripefs fs1 -r > Scanning file system metadata, phase 1 ... > Error processing inodes. > No space left on device > mmrestripefs: Command failed. Examine previous error messages to > determine cause. > > It is, however, still possible to create new files on the filesystem. > When I return one of the SATA disks as a dataAndMetadata disk, the > mmrestripefs > command stops complaining about No space left on device. Both df and mmdf > say that there is enough space both for data (SATA) and metadata (SSDs). > Does anyone have an idea why is it complaining? > > Thanks, > > -- > Miroslav Bauer > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Sep 1 14:39:17 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 09:39:17 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: By the way, I suspect the no space on device errors are because GPFS believes for some reason that it is unable to maintain the metadata replication factor of 2 that's likely set on all previously created inodes. On 9/1/16 9:36 AM, Aaron Knister wrote: > I must admit, I'm curious as to the reason you're dropping the > replication factor from 2 down to 1. There are some serious advantages > we've seen to having multiple metadata replicas, as far as error > recovery is concerned. > > Could you paste an output of mmlsdisk for the filesystem? > > -Aaron > > On 9/1/16 9:30 AM, Miroslav Bauer wrote: >> Hello, >> >> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >> filesystem metadata from state: >> -m = 2 (default metadata replicas) >> - SATA disks (dataAndMetadata, failGroup=1) >> - SSDs (metadataOnly, failGroup=3) >> to the desired state: >> -m = 1 >> - SATA disks (dataOnly, failGroup=1) >> - SSDs (metadataOnly, failGroup=3) >> >> I have done the following steps in the following order: >> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >> attribute only): >> # mmchdisk fs1 change -F dataOnly_disks.stanza >> Attention: Disk parameters were changed. >> Use the mmrestripefs command with the -r option to relocate data and >> metadata. >> Verifying file system configuration information ... >> mmchdisk: Propagating the cluster configuration data to all >> affected nodes. This is an asynchronous process. >> >> 2) change default metadata replicas number 2->1 >> # mmchfs fs1 -m 1 >> >> 3) run mmrestripefs as suggested by output of 1) >> # mmrestripefs fs1 -r >> Scanning file system metadata, phase 1 ... >> Error processing inodes. >> No space left on device >> mmrestripefs: Command failed. Examine previous error messages to >> determine cause. >> >> It is, however, still possible to create new files on the filesystem. >> When I return one of the SATA disks as a dataAndMetadata disk, the >> mmrestripefs >> command stops complaining about No space left on device. Both df and mmdf >> say that there is enough space both for data (SATA) and metadata (SSDs). >> Does anyone have an idea why is it complaining? >> >> Thanks, >> >> -- >> Miroslav Bauer >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan at buzzard.me.uk Thu Sep 1 14:49:11 2016 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 01 Sep 2016 14:49:11 +0100 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: <1472737751.25479.22.camel@buzzard.phy.strath.ac.uk> On Thu, 2016-09-01 at 09:39 -0400, Aaron Knister wrote: > By the way, I suspect the no space on device errors are because GPFS > believes for some reason that it is unable to maintain the metadata > replication factor of 2 that's likely set on all previously created inodes. > Hazarding a guess, but there is only one SSD NSD, so if all the metadata is going to go on SSD there is no point in replicating. It would also explain why it would believe it can't maintain the metadata replication factor. Though it could just be a simple metadata size is larger than the available SSD size. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From makaplan at us.ibm.com Thu Sep 1 14:59:28 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 1 Sep 2016 09:59:28 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: I've been told that it is a big leap to go from supporting GSS and ESS to allowing and supporting native raid for customers who may throw together "any" combination of hardware they might choose. In particular the GNR "disk hospital" functions... https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm will be tricky to support on umpteen different vendor boxes -- and keep in mind, those will be from IBM competitors! That said, ESS and GSS show that IBM has some good tech in this area and IBM has shown with the Spectrum Scale product (sans GNR) it can support just about any semi-reasonable hardware configuration and a good slew of OS versions and architectures... Heck I have a demo/test version of GPFS running on a 5 year old Thinkpad laptop.... And we have some GSSs in the lab... Not to mention Power hardware and mainframe System Z (think 360, 370, 290, Z) -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Sep 1 15:02:50 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 10:02:50 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Oh! I think you've already provided the info I was looking for :) I thought that failGroup=3 meant there were 3 failure groups within the SSDs. I suspect that's not at all what you meant and that actually is the failure group of all of those disks. That I think explains what's going on-- there's only one failure group's worth of metadata-capable disks available and as such GPFS can't place the 2nd replica for existing files. Here's what I would suggest: - Create at least 2 failure groups within the SSDs - Put the default metadata replication factor back to 2 - Run a restripefs -R to shuffle files around and restore the metadata replication factor of 2 to any files created while it was set to 1 If you're not interested in replication for metadata then perhaps all you need to do is the mmrestripefs -R. I think that should un-replicate the file from the SATA disks leaving the copy on the SSDs. Hope that helps. -Aaron On 9/1/16 9:39 AM, Aaron Knister wrote: > By the way, I suspect the no space on device errors are because GPFS > believes for some reason that it is unable to maintain the metadata > replication factor of 2 that's likely set on all previously created inodes. > > On 9/1/16 9:36 AM, Aaron Knister wrote: >> I must admit, I'm curious as to the reason you're dropping the >> replication factor from 2 down to 1. There are some serious advantages >> we've seen to having multiple metadata replicas, as far as error >> recovery is concerned. >> >> Could you paste an output of mmlsdisk for the filesystem? >> >> -Aaron >> >> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>> Hello, >>> >>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>> filesystem metadata from state: >>> -m = 2 (default metadata replicas) >>> - SATA disks (dataAndMetadata, failGroup=1) >>> - SSDs (metadataOnly, failGroup=3) >>> to the desired state: >>> -m = 1 >>> - SATA disks (dataOnly, failGroup=1) >>> - SSDs (metadataOnly, failGroup=3) >>> >>> I have done the following steps in the following order: >>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>> attribute only): >>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>> Attention: Disk parameters were changed. >>> Use the mmrestripefs command with the -r option to relocate data and >>> metadata. >>> Verifying file system configuration information ... >>> mmchdisk: Propagating the cluster configuration data to all >>> affected nodes. This is an asynchronous process. >>> >>> 2) change default metadata replicas number 2->1 >>> # mmchfs fs1 -m 1 >>> >>> 3) run mmrestripefs as suggested by output of 1) >>> # mmrestripefs fs1 -r >>> Scanning file system metadata, phase 1 ... >>> Error processing inodes. >>> No space left on device >>> mmrestripefs: Command failed. Examine previous error messages to >>> determine cause. >>> >>> It is, however, still possible to create new files on the filesystem. >>> When I return one of the SATA disks as a dataAndMetadata disk, the >>> mmrestripefs >>> command stops complaining about No space left on device. Both df and >>> mmdf >>> say that there is enough space both for data (SATA) and metadata (SSDs). >>> Does anyone have an idea why is it complaining? >>> >>> Thanks, >>> >>> -- >>> Miroslav Bauer >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Thu Sep 1 15:14:18 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 1 Sep 2016 10:14:18 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Message-ID: I believe the OP left out a step. I am not saying this is a good idea, but ... One must change the replication factors marked in each inode for each file... This could be done using an mmapplypolicy rule: RULE 'one' MIGRATE FROM POOL 'yourdatapool' TO POOL 'yourdatapool' REPLICATE(1,1) (repeat rule for each POOL you have) Put that (those) rules in a file and do a "one time" run like mmapplypolicy yourfilesystem -P /path/to/rule -N nodelist-to-do-this-work -g /filesystem/bigtemp -I defer Then try your restripe again. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From bauer at cesnet.cz Thu Sep 1 15:28:36 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Thu, 1 Sep 2016 16:28:36 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Message-ID: <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Yes, failure group id is exactly what I meant :). Unfortunately, mmrestripefs with -R behaves the same as with -r. I also believed that mmrestripefs -R is the correct tool for fixing the replication settings on inodes (according to manpages), but I will try possible solutions you and Marc suggested and let you know how it went. Thank you, -- Miroslav Bauer On 09/01/2016 04:02 PM, Aaron Knister wrote: > Oh! I think you've already provided the info I was looking for :) I > thought that failGroup=3 meant there were 3 failure groups within the > SSDs. I suspect that's not at all what you meant and that actually is > the failure group of all of those disks. That I think explains what's > going on-- there's only one failure group's worth of metadata-capable > disks available and as such GPFS can't place the 2nd replica for > existing files. > > Here's what I would suggest: > > - Create at least 2 failure groups within the SSDs > - Put the default metadata replication factor back to 2 > - Run a restripefs -R to shuffle files around and restore the metadata > replication factor of 2 to any files created while it was set to 1 > > If you're not interested in replication for metadata then perhaps all > you need to do is the mmrestripefs -R. I think that should > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > Hope that helps. > > -Aaron > > On 9/1/16 9:39 AM, Aaron Knister wrote: >> By the way, I suspect the no space on device errors are because GPFS >> believes for some reason that it is unable to maintain the metadata >> replication factor of 2 that's likely set on all previously created >> inodes. >> >> On 9/1/16 9:36 AM, Aaron Knister wrote: >>> I must admit, I'm curious as to the reason you're dropping the >>> replication factor from 2 down to 1. There are some serious advantages >>> we've seen to having multiple metadata replicas, as far as error >>> recovery is concerned. >>> >>> Could you paste an output of mmlsdisk for the filesystem? >>> >>> -Aaron >>> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>>> Hello, >>>> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>>> filesystem metadata from state: >>>> -m = 2 (default metadata replicas) >>>> - SATA disks (dataAndMetadata, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> to the desired state: >>>> -m = 1 >>>> - SATA disks (dataOnly, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> >>>> I have done the following steps in the following order: >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>>> attribute only): >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>>> Attention: Disk parameters were changed. >>>> Use the mmrestripefs command with the -r option to relocate data and >>>> metadata. >>>> Verifying file system configuration information ... >>>> mmchdisk: Propagating the cluster configuration data to all >>>> affected nodes. This is an asynchronous process. >>>> >>>> 2) change default metadata replicas number 2->1 >>>> # mmchfs fs1 -m 1 >>>> >>>> 3) run mmrestripefs as suggested by output of 1) >>>> # mmrestripefs fs1 -r >>>> Scanning file system metadata, phase 1 ... >>>> Error processing inodes. >>>> No space left on device >>>> mmrestripefs: Command failed. Examine previous error messages to >>>> determine cause. >>>> >>>> It is, however, still possible to create new files on the filesystem. >>>> When I return one of the SATA disks as a dataAndMetadata disk, the >>>> mmrestripefs >>>> command stops complaining about No space left on device. Both df and >>>> mmdf >>>> say that there is enough space both for data (SATA) and metadata >>>> (SSDs). >>>> Does anyone have an idea why is it complaining? >>>> >>>> Thanks, >>>> >>>> -- >>>> Miroslav Bauer >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From S.J.Thompson at bham.ac.uk Thu Sep 1 22:06:44 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 1 Sep 2016 21:06:44 +0000 Subject: [gpfsug-discuss] Maximum value for data replication? In-Reply-To: References: , , Message-ID: I have two protocol node in each of two data centres. So four protocol nodes in the cluster. Plus I also have a quorum vm which is lockstep/ha so guaranteed to survive in one of the data centres should we lose power. The protocol servers being protocol servers don't have access to the fibre channel storage. And we've seen ces do bad things when the storage cluster it is remotely mounting (and the ces root is on) fails/is under load etc. So the four full copies is about guaranteeing there are two full copies in both data centres. And remember this is only for the cesroot, so lock data for the ces ips, the smb registry I think as well. I was hoping that by making the cesroot in the protocol node cluster rather than a fileset on a remote mounted filesysyem, that it would fix the ces weirdness we see as it would become a local gpfs file system. I guess three copies would maybe work. But also in another cluster, we have been thinking about adding NVMe into NSD servers for metadata and system.log and so I can se there are cases there where having higher numbers of copies would be useful. Yes I take the point that more copies means more load for the client, but in these cases, we aren't thinking about gpfs as the fastest possible hpc file system, but for other infrastructure purposes, which is one of the ways the product seems to be moving. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Daniel Kidger [daniel.kidger at uk.ibm.com] Sent: 01 September 2016 12:22 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Maximum value for data replication? Simon, Hi. Can you explain why you would like a full copy of all blocks on all 4 NSD servers ? Is there a particular use case, and hence an interest from product development? Otherwise remember that with 4 NSD servers, with one failure group per (storage rich) NSD server, then all 4 disk arrays will be loaded equally, as new files will get written to any 3 (or 2 or 1) of the 4 failure groups. Also remember that as you add more replication then there is more network load on the gpfs client as it has to perform all the writes itself. Perhaps someone technical can comment on the logic that determines which '3' out of 4 failure groups, a particular block is written to. Daniel [/spectrum_storage-banne] [Spectrum Scale Logo] Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-07818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Steve Duersch Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: Re: [gpfsug-discuss] Maximum value for data replication? Date: Wed, Aug 31, 2016 1:45 PM >>Is there a maximum value for data replication in Spectrum Scale? The maximum value for replication is 3. Steve Duersch Spectrum Scale RAID 845-433-7902 IBM Poughkeepsie, New York [Inactive hide details for gpfsug-discuss-request---08/30/2016 07:25:24 PM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/30/2016 07:25:24 PM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/30/2016 07:25 PM Subject: gpfsug-discuss Digest, Vol 55, Issue 55 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Maximum value for data replication? (Simon Thompson (Research Computing - IT Services)) 2. greetings (Kevin D Johnson) 3. GPFS 3.5.0 on RHEL 6.8 (Lukas Hejtmanek) 4. Re: GPFS 3.5.0 on RHEL 6.8 (Kevin D Johnson) 5. Re: GPFS 3.5.0 on RHEL 6.8 (mark.bergman at uphs.upenn.edu) 6. Re: *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" (Lukas Hejtmanek) 7. Re: *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Tue, 30 Aug 2016 19:09:05 +0000 From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Maximum value for data replication? Message-ID: Content-Type: text/plain; charset="us-ascii" Is there a maximum value for data replication in Spectrum Scale? I have a number of nsd servers which have local storage and Id like each node to have a full copy of all the data in the file-system, say this value is 4, can I set replication to 4 for data and metadata and have each server have a full copy? These are protocol nodes and multi cluster mount another file system (yes I know not supported) and the cesroot is in the remote file system. On several occasions where GPFS has wibbled a bit, this has caused issues with ces locks, so I was thinking of moving the cesroot to a local filesysyem which is replicated on the local ssds in the protocol nodes. I.e. Its a generally quiet file system as its only ces cluster config. I assume if I stop protocols, rsync the data and then change to the new ces root, I should be able to get this working? Thanks Simon ------------------------------ Message: 2 Date: Tue, 30 Aug 2016 19:43:39 +0000 From: "Kevin D Johnson" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] greetings Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Tue, 30 Aug 2016 22:39:18 +0200 From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: <20160830203917.qptfgqvlmdbzu6wr at ics.muni.cz> Content-Type: text/plain; charset=iso-8859-2 Hello, does it work for anyone? As of kernel 2.6.32-642, GPFS 3.5.0 (including the latest patch 32) does start but does not mount and file system. The internal mount cmd gets stucked. -- Luk?? Hejtm?nek ------------------------------ Message: 4 Date: Tue, 30 Aug 2016 20:51:39 +0000 From: "Kevin D Johnson" To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 5 Date: Tue, 30 Aug 2016 17:07:21 -0400 From: mark.bergman at uphs.upenn.edu To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: <24437-1472591241.445832 at bR6O.TofS.917u> Content-Type: text/plain; charset="UTF-8" In the message dated: Tue, 30 Aug 2016 22:39:18 +0200, The pithy ruminations from Lukas Hejtmanek on <[gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8> were: => Hello, GPFS 3.5.0.[23..3-0] work for me under [CentOS|ScientificLinux] 6.8, but at kernel 2.6.32-573 and lower. I've found kernel bugs in blk_cloned_rq_check_limits() in later kernel revs that caused multipath errors, resulting in GPFS being unable to find all NSDs and mount the filesystem. I am not updating to a newer kernel until I'm certain this is resolved. I opened a bug with CentOS: https://bugs.centos.org/view.php?id=10997 and began an extended discussion with the (RH & SUSE) developers of that chunk of kernel code. I don't know if an upstream bug has been opened by RH, but see: https://patchwork.kernel.org/patch/9140337/ => => does it work for anyone? As of kernel 2.6.32-642, GPFS 3.5.0 (including the => latest patch 32) does start but does not mount and file system. The internal => mount cmd gets stucked. => => -- => Luk?? Hejtm?nek -- Mark Bergman voice: 215-746-4061 mark.bergman at uphs.upenn.edu fax: 215-614-0266 http://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania PGP Key: http://www.cbica.upenn.edu/sbia/bergman ------------------------------ Message: 6 Date: Wed, 31 Aug 2016 00:02:50 +0200 From: Lukas Hejtmanek To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" Message-ID: <20160830220250.yt6r7gvfq7rlvtcs at ics.muni.cz> Content-Type: text/plain; charset=iso-8859-2 Hello, On Mon, Aug 29, 2016 at 09:20:46AM +0200, Frank Kraemer wrote: > Find the paper here: > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Storage%20Manager/page/Petascale%20Data%20Protection thank you for the paper, I appreciate it. However, I wonder whether it could be extended a little. As it has the title Petascale Data Protection, I think that in Peta scale, you have to deal with millions (well rather hundreds of millions) of files you store in and this is something where TSM does not scale well. Could you give some hints: On the backup site: mmbackup takes ages for: a) scan (try to scan 500M files even in parallel) b) backup - what if 10 % of files get changed - backup process can be blocked several days as mmbackup cannot run in several instances on the same file system, so you have to wait until one run of mmbackup finishes. How long could it take at petascale? On the restore site: how can I restore e.g. 40 millions of file efficiently? dsmc restore '/path/*' runs into serious troubles after say 20M files (maybe wrong internal structures used), however, scanning 1000 more files takes several minutes resulting the dsmc restore never reaches that 40M files. using filelists the situation is even worse. I run dsmc restore -filelist with a filelist consisting of 2.4M files. Running for *two* days without restoring even a single file. dsmc is consuming 100 % CPU. So any hints addressing these issues with really large number of files would be even more appreciated. -- Luk?? Hejtm?nek ------------------------------ Message: 7 Date: Tue, 30 Aug 2016 16:24:59 -0700 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" Message-ID: Content-Type: text/plain; charset="utf-8" so lets start with some simple questions. when you say mmbackup takes ages, what version of gpfs code are you running ? how do you execute the mmbackup command ? exact parameters would be useful . what HW are you using for the metadata disks ? how much capacity (df -h) and how many inodes (df -i) do you have in the filesystem you try to backup ? sven On Tue, Aug 30, 2016 at 3:02 PM, Lukas Hejtmanek wrote: > Hello, > > On Mon, Aug 29, 2016 at 09:20:46AM +0200, Frank Kraemer wrote: > > Find the paper here: > > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/ > Tivoli%20Storage%20Manager/page/Petascale%20Data%20Protection > > thank you for the paper, I appreciate it. > > However, I wonder whether it could be extended a little. As it has the > title > Petascale Data Protection, I think that in Peta scale, you have to deal > with > millions (well rather hundreds of millions) of files you store in and this > is > something where TSM does not scale well. > > Could you give some hints: > > On the backup site: > mmbackup takes ages for: > a) scan (try to scan 500M files even in parallel) > b) backup - what if 10 % of files get changed - backup process can be > blocked > several days as mmbackup cannot run in several instances on the same file > system, so you have to wait until one run of mmbackup finishes. How long > could > it take at petascale? > > On the restore site: > how can I restore e.g. 40 millions of file efficiently? dsmc restore > '/path/*' > runs into serious troubles after say 20M files (maybe wrong internal > structures used), however, scanning 1000 more files takes several minutes > resulting the dsmc restore never reaches that 40M files. > > using filelists the situation is even worse. I run dsmc restore -filelist > with a filelist consisting of 2.4M files. Running for *two* days without > restoring even a single file. dsmc is consuming 100 % CPU. > > So any hints addressing these issues with really large number of files > would > be even more appreciated. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 55 ********************************************** _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif Type: image/gif Size: 105 bytes Desc: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif URL: From r.sobey at imperial.ac.uk Fri Sep 2 14:37:26 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 13:37:26 +0000 Subject: [gpfsug-discuss] CES node responding on system IP address Message-ID: Hi all, *Should* a CES node, 4.2.0 OR 4.2.1, be responding on its system IP address? The nodes in my cluster, seemingly randomly, either give me a list of shares, or prompt me to enter a username and password. For example, Start > Run \\cesnode.fqdn I get a prompt for a username and password. If I add the system IP into my hosts file and call it clustername.fqdn it responds normally i.e. no prompt for username or password. Should I be worried about the inconsistencies here? Richard Sobey Storage Area Network (SAN) Analyst Technical Operations, ICT Imperial College London South Kensington 403, City & Guilds Building London SW7 2AZ Tel: +44 (0)20 7594 6915 Email: r.sobey at imperial.ac.uk http://www.imperial.ac.uk/admin-services/ict/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Sep 2 16:10:59 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 15:10:59 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: I?ve verified the upgrade has fixed this issue so thanks again. However I?ve noticed that stopping SMB doesn?t trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Sep 2 16:15:30 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 2 Sep 2016 15:15:30 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: Should it? If you were running nfs and smb, would you necessarily want to fail the ip over? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] Sent: 02 September 2016 16:10 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? I?ve verified the upgrade has fixed this issue so thanks again. However I?ve noticed that stopping SMB doesn?t trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** From r.sobey at imperial.ac.uk Fri Sep 2 16:23:28 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 15:23:28 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) Sent: 02 September 2016 16:16 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Should it? If you were running nfs and smb, would you necessarily want to fail the ip over? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] Sent: 02 September 2016 16:10 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? I've verified the upgrade has fixed this issue so thanks again. However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ulmer at ulmer.org Fri Sep 2 17:02:44 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 2 Sep 2016 12:02:44 -0400 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. Liberty, -- Stephen > On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: > > A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. > > Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) > Sent: 02 September 2016 16:16 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > > Should it? > > If you were running nfs and smb, would you necessarily want to fail the ip over? > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] > Sent: 02 September 2016 16:10 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > I've verified the upgrade has fixed this issue so thanks again. > > However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. > > Richard > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez > Sent: 27 August 2016 13:53 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > Hi Richard > > This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly > > > Just do this. > > edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py > > > > Change > > if authType == 'ad' and not nodeState.nfsStopped: > > to > > > > nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) > if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: > > > You need to stop the gpfs service in each node where you apply the change > > > " after change the lines please use tap key" > > > > Enviado desde mi iPhone > > El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) > 2. Re: CES and mmuserauth command (Christof Schmitt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Message-ID: > > > > Content-Type: text/plain; charset="UTF-8" > > That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: "Sobey, Richard A" > > To: "'gpfsug-discuss at spectrumscale.org'" > > > Date: 08/26/2016 04:48 AM > Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. > > I?m trying to stop SMB on a node, but getting the following output: > > [root at cesnode ~]# mmces service stop smb > smb: Request denied. Please stop NFS first > > [root at cesnode ~]# mmces service list > Enabled services: SMB > SMB is running > > As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. > > SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. > > Richard_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Message-ID: > > > > Content-Type: text/plain; charset="ISO-2022-JP" > > The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. > > The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and > LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: Jan-Frode Myklebust > > To: gpfsug main discussion list > > Date: 08/26/2016 05:59 AM > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: > > When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). > > > That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: > > ------------------------------------------------------------------------------------------------------ > ??user?name userName > Specifies the user name to be used to perform operations > against the authentication server. The specified user > name must have sufficient permissions to read user and > group attributes from the authentication server. > ------------------------------------------------------------------------------------------------------- > > Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? > > > > -jf_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 55, Issue 44 > ********************************************** > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From laurence at qsplace.co.uk Fri Sep 2 18:54:02 2016 From: laurence at qsplace.co.uk (Laurence Horrors-Barlow) Date: Fri, 2 Sep 2016 19:54:02 +0200 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: <721250E5-767B-4C44-A9E1-5DD255FD4F7D@qsplace.co.uk> I believe the services auto restart on a crash (or kill), a change I noticed between 4.1.1 and 4.2 hence no IP fail over. Suspending a node to force a fail over is possible the most sensible approach. -- Lauz Sent from my iPad > On 2 Sep 2016, at 18:02, Stephen Ulmer wrote: > > I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. > > If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. > > Liberty, > > -- > Stephen > > > >> On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: >> >> A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. >> >> Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) >> Sent: 02 September 2016 16:16 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> >> Should it? >> >> If you were running nfs and smb, would you necessarily want to fail the ip over? >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] >> Sent: 02 September 2016 16:10 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> I've verified the upgrade has fixed this issue so thanks again. >> >> However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. >> >> Richard >> >> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez >> Sent: 27 August 2016 13:53 >> To: gpfsug-discuss at spectrumscale.org >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> Hi Richard >> >> This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly >> >> >> Just do this. >> >> edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py >> >> >> >> Change >> >> if authType == 'ad' and not nodeState.nfsStopped: >> >> to >> >> >> >> nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) >> if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: >> >> >> You need to stop the gpfs service in each node where you apply the change >> >> >> " after change the lines please use tap key" >> >> >> >> Enviado desde mi iPhone >> >> El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: >> Send gpfsug-discuss mailing list submissions to >> gpfsug-discuss at spectrumscale.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> or, via email, send a message with subject or body 'help' to >> gpfsug-discuss-request at spectrumscale.org >> >> You can reach the person managing the list at >> gpfsug-discuss-owner at spectrumscale.org >> >> When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." >> >> >> Today's Topics: >> >> 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) >> 2. Re: CES and mmuserauth command (Christof Schmitt) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 26 Aug 2016 12:29:31 -0400 >> From: "Christof Schmitt" > >> To: gpfsug main discussion list > >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> Message-ID: >> > >> >> Content-Type: text/plain; charset="UTF-8" >> >> That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. >> >> Regards, >> >> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ >> christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) >> >> >> >> From: "Sobey, Richard A" > >> To: "'gpfsug-discuss at spectrumscale.org'" >> > >> Date: 08/26/2016 04:48 AM >> Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> >> >> Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. >> >> I?m trying to stop SMB on a node, but getting the following output: >> >> [root at cesnode ~]# mmces service stop smb >> smb: Request denied. Please stop NFS first >> >> [root at cesnode ~]# mmces service list >> Enabled services: SMB >> SMB is running >> >> As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. >> >> SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. >> >> Richard_______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Fri, 26 Aug 2016 12:29:31 -0400 >> From: "Christof Schmitt" > >> To: gpfsug main discussion list > >> Subject: Re: [gpfsug-discuss] CES and mmuserauth command >> Message-ID: >> > >> >> Content-Type: text/plain; charset="ISO-2022-JP" >> >> The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. >> >> The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and >> LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. >> >> Regards, >> >> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ >> christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) >> >> >> >> From: Jan-Frode Myklebust > >> To: gpfsug main discussion list > >> Date: 08/26/2016 05:59 AM >> Subject: Re: [gpfsug-discuss] CES and mmuserauth command >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> >> >> >> On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: >> >> When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). >> >> >> That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: >> >> ------------------------------------------------------------------------------------------------------ >> ??user?name userName >> Specifies the user name to be used to perform operations >> against the authentication server. The specified user >> name must have sufficient permissions to read user and >> group attributes from the authentication server. >> ------------------------------------------------------------------------------------------------------- >> >> Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? >> >> >> >> -jf_______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> >> ------------------------------ >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> End of gpfsug-discuss Digest, Vol 55, Issue 44 >> ********************************************** >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From christof.schmitt at us.ibm.com Fri Sep 2 19:20:45 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 2 Sep 2016 11:20:45 -0700 Subject: [gpfsug-discuss] CES and mmuserauth command In-Reply-To: References: Message-ID: After looking into this again, the source of confusion is probably from the fact that there are three different authentication schemes present here: When configuring a LDAP server for file or object authentication, then the specified server, user and password are used during normal operations for querying user data. The same applies for configuring object authentication with AD; AD is here treated as a LDAP server. Configuring AD for file authentication is different in that during the "mmuserauth service create", the machine account is created, and then that account is used to connect to a DC that is chosen from the DCs discovered through DNS and not necessarily the one used for the initial configuration. I submitted an internal request to explain this better in the mmuserauth manpage. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Christof Schmitt/Tucson/IBM at IBMUS To: gpfsug main discussion list Date: 08/26/2016 09:30 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From r.sobey at imperial.ac.uk Fri Sep 2 22:02:03 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 21:02:03 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: That makes more sense putting it that way. Cheers Richard Get Outlook for Android On Fri, Sep 2, 2016 at 5:04 PM +0100, "Stephen Ulmer" > wrote: I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. Liberty, -- Stephen > On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: > > A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. > > Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) > Sent: 02 September 2016 16:16 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > > Should it? > > If you were running nfs and smb, would you necessarily want to fail the ip over? > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] > Sent: 02 September 2016 16:10 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > I've verified the upgrade has fixed this issue so thanks again. > > However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. > > Richard > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez > Sent: 27 August 2016 13:53 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > Hi Richard > > This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly > > > Just do this. > > edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py > > > > Change > > if authType == 'ad' and not nodeState.nfsStopped: > > to > > > > nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) > if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: > > > You need to stop the gpfs service in each node where you apply the change > > > " after change the lines please use tap key" > > > > Enviado desde mi iPhone > > El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) > 2. Re: CES and mmuserauth command (Christof Schmitt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Message-ID: > > > > Content-Type: text/plain; charset="UTF-8" > > That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: "Sobey, Richard A" > > To: "'gpfsug-discuss at spectrumscale.org'" > > > Date: 08/26/2016 04:48 AM > Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. > > I?m trying to stop SMB on a node, but getting the following output: > > [root at cesnode ~]# mmces service stop smb > smb: Request denied. Please stop NFS first > > [root at cesnode ~]# mmces service list > Enabled services: SMB > SMB is running > > As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. > > SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. > > Richard_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Message-ID: > > > > Content-Type: text/plain; charset="ISO-2022-JP" > > The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. > > The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and > LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: Jan-Frode Myklebust > > To: gpfsug main discussion list > > Date: 08/26/2016 05:59 AM > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: > > When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). > > > That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: > > ------------------------------------------------------------------------------------------------------ > ??user?name userName > Specifies the user name to be used to perform operations > against the authentication server. The specified user > name must have sufficient permissions to read user and > group attributes from the authentication server. > ------------------------------------------------------------------------------------------------------- > > Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? > > > > -jf_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 55, Issue 44 > ********************************************** > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bauer at cesnet.cz Mon Sep 5 14:30:54 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Mon, 5 Sep 2016 15:30:54 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Message-ID: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From janfrode at tanso.net Mon Sep 5 14:51:44 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 05 Sep 2016 13:51:44 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Message-ID: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer : > Hello, > > is there any way to recall a migrated file back to a regular state > (other than renaming a file)? I would like to free some space > on an external pool (TSM), that is being used by migrated files. > And it would be desirable to prevent repeated backups of an > already backed-up data (due to changed ctime/inode). > > I guess that you can acheive only premigrated state with dsmrecall tool > (two copies of file data - one on GPFS pool and one on external pool). > Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think > it's safe, nor clean :). > > Thank you in advance, > > -- > Miroslav Bauer > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bauer at cesnet.cz Mon Sep 5 15:13:42 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Mon, 5 Sep 2016 16:13:42 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Message-ID: <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: > I believe what you're looking for is dsmrecall -RESident. Plus > reconcile on tsm-server to free up the space. > > Ref: > > http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html > > > -jf > man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: > > Hello, > > is there any way to recall a migrated file back to a regular state > (other than renaming a file)? I would like to free some space > on an external pool (TSM), that is being used by migrated files. > And it would be desirable to prevent repeated backups of an > already backed-up data (due to changed ctime/inode). > > I guess that you can acheive only premigrated state with dsmrecall > tool > (two copies of file data - one on GPFS pool and one on external pool). > Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't > think > it's safe, nor clean :). > > Thank you in advance, > > -- > Miroslav Bauer > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From mark.birmingham at stfc.ac.uk Mon Sep 5 15:27:29 2016 From: mark.birmingham at stfc.ac.uk (mark.birmingham at stfc.ac.uk) Date: Mon, 5 Sep 2016 14:27:29 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> Message-ID: <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> Yes, that's fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.birmingham at stfc.ac.uk Mon Sep 5 15:30:53 2016 From: mark.birmingham at stfc.ac.uk (mark.birmingham at stfc.ac.uk) Date: Mon, 5 Sep 2016 14:30:53 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> Message-ID: <47B8D67E32CC2D44A587CD18636BECC82BB3A62A@exchmbx01> Sorry All! Noob error - replied to the wrong email!!! Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of mark.birmingham at stfc.ac.uk Sent: 05 September 2016 15:27 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Yes, that's fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominic.mueller at de.ibm.com Tue Sep 6 13:04:36 2016 From: dominic.mueller at de.ibm.com (Dominic Mueller-Wicke01) Date: Tue, 6 Sep 2016 14:04:36 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: Hi Miroslav, please use the command: > dsmrecall -resident -detail or use it with file lists Greetings, Dominic. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 06.09.2016 13:00 Subject: gpfsug-discuss Digest, Vol 56, Issue 10 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: DMAPI - Unmigrate file to Regular state (mark.birmingham at stfc.ac.uk) ----- Message from on Mon, 5 Sep 2016 14:30:53 +0000 ----- To: Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Sorry All! Noob error ? replied to the wrong email!!! Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of mark.birmingham at stfc.ac.uk Sent: 05 September 2016 15:27 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Yes, that?s fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer : Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From volobuev at us.ibm.com Tue Sep 6 20:06:32 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 6 Sep 2016 12:06:32 -0700 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz><2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: The correct way to accomplish what you're looking for (in particular, changing the fs-wide level of replication) is mmrestripefs -R. This command also takes care of moving data off disks now marked metadataOnly. The restripe job hits an error trying to move blocks of the inode file, i.e. before it gets to actual user data blocks. Note that at this point the metadata replication factor is still 2. This suggests one of two possibilities: (1) there isn't enough actual free space on the remaining metadataOnly disks, (2) there isn't enough space in some failure groups to allocate two replicas. All of this assumes you're operating within a single storage pool. If multiple storage pools are in play, there are other possibilities. 'mmdf' output would be helpful in providing more helpful advice. With the information at hand, I can only suggest trying to accomplish the task in two phases: (a) deallocated extra metadata replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do want to point out that metadata replication is a highly recommended insurance policy to have for your file system. As with other kinds of insurance, you may or may not need it, but if you do end up needing it, you'll be very glad you have it. The costs, in terms of extra metadata space and performance overhead, are very reasonable. yuri From: Miroslav Bauer To: gpfsug-discuss at spectrumscale.org, Date: 09/01/2016 07:29 AM Subject: Re: [gpfsug-discuss] Migration to separate metadata and data disks Sent by: gpfsug-discuss-bounces at spectrumscale.org Yes, failure group id is exactly what I meant :). Unfortunately, mmrestripefs with -R behaves the same as with -r. I also believed that mmrestripefs -R is the correct tool for fixing the replication settings on inodes (according to manpages), but I will try possible solutions you and Marc suggested and let you know how it went. Thank you, -- Miroslav Bauer On 09/01/2016 04:02 PM, Aaron Knister wrote: > Oh! I think you've already provided the info I was looking for :) I > thought that failGroup=3 meant there were 3 failure groups within the > SSDs. I suspect that's not at all what you meant and that actually is > the failure group of all of those disks. That I think explains what's > going on-- there's only one failure group's worth of metadata-capable > disks available and as such GPFS can't place the 2nd replica for > existing files. > > Here's what I would suggest: > > - Create at least 2 failure groups within the SSDs > - Put the default metadata replication factor back to 2 > - Run a restripefs -R to shuffle files around and restore the metadata > replication factor of 2 to any files created while it was set to 1 > > If you're not interested in replication for metadata then perhaps all > you need to do is the mmrestripefs -R. I think that should > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > Hope that helps. > > -Aaron > > On 9/1/16 9:39 AM, Aaron Knister wrote: >> By the way, I suspect the no space on device errors are because GPFS >> believes for some reason that it is unable to maintain the metadata >> replication factor of 2 that's likely set on all previously created >> inodes. >> >> On 9/1/16 9:36 AM, Aaron Knister wrote: >>> I must admit, I'm curious as to the reason you're dropping the >>> replication factor from 2 down to 1. There are some serious advantages >>> we've seen to having multiple metadata replicas, as far as error >>> recovery is concerned. >>> >>> Could you paste an output of mmlsdisk for the filesystem? >>> >>> -Aaron >>> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>>> Hello, >>>> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>>> filesystem metadata from state: >>>> -m = 2 (default metadata replicas) >>>> - SATA disks (dataAndMetadata, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> to the desired state: >>>> -m = 1 >>>> - SATA disks (dataOnly, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> >>>> I have done the following steps in the following order: >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>>> attribute only): >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>>> Attention: Disk parameters were changed. >>>> Use the mmrestripefs command with the -r option to relocate data and >>>> metadata. >>>> Verifying file system configuration information ... >>>> mmchdisk: Propagating the cluster configuration data to all >>>> affected nodes. This is an asynchronous process. >>>> >>>> 2) change default metadata replicas number 2->1 >>>> # mmchfs fs1 -m 1 >>>> >>>> 3) run mmrestripefs as suggested by output of 1) >>>> # mmrestripefs fs1 -r >>>> Scanning file system metadata, phase 1 ... >>>> Error processing inodes. >>>> No space left on device >>>> mmrestripefs: Command failed. Examine previous error messages to >>>> determine cause. >>>> >>>> It is, however, still possible to create new files on the filesystem. >>>> When I return one of the SATA disks as a dataAndMetadata disk, the >>>> mmrestripefs >>>> command stops complaining about No space left on device. Both df and >>>> mmdf >>>> say that there is enough space both for data (SATA) and metadata >>>> (SSDs). >>>> Does anyone have an idea why is it complaining? >>>> >>>> Thanks, >>>> >>>> -- >>>> Miroslav Bauer >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >> > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bauer at cesnet.cz Wed Sep 7 10:40:19 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Wed, 7 Sep 2016 11:40:19 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: Hello Yuri, here goes the actual mmdf output of filesystem in question: disk disk size failure holds holds free free name group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 40 TB) dcsh_10C 5T 1 Yes Yes 1.661T ( 33%) 68.48G ( 1%) dcsh_10D 6.828T 1 Yes Yes 2.809T ( 41%) 83.82G ( 1%) dcsh_11C 5T 1 Yes Yes 1.659T ( 33%) 69.01G ( 1%) dcsh_11D 6.828T 1 Yes Yes 2.81T ( 41%) 83.33G ( 1%) dcsh_12C 5T 1 Yes Yes 1.659T ( 33%) 69.48G ( 1%) dcsh_12D 6.828T 1 Yes Yes 2.807T ( 41%) 83.14G ( 1%) dcsh_13C 5T 1 Yes Yes 1.659T ( 33%) 69.35G ( 1%) dcsh_13D 6.828T 1 Yes Yes 2.81T ( 41%) 82.97G ( 1%) dcsh_14C 5T 1 Yes Yes 1.66T ( 33%) 69.06G ( 1%) dcsh_14D 6.828T 1 Yes Yes 2.811T ( 41%) 83.61G ( 1%) dcsh_15C 5T 1 Yes Yes 1.658T ( 33%) 69.38G ( 1%) dcsh_15D 6.828T 1 Yes Yes 2.814T ( 41%) 83.69G ( 1%) dcsd_15D 6.828T 1 Yes Yes 2.811T ( 41%) 83.98G ( 1%) dcsd_15C 5T 1 Yes Yes 1.66T ( 33%) 68.66G ( 1%) dcsd_14D 6.828T 1 Yes Yes 2.81T ( 41%) 84.18G ( 1%) dcsd_14C 5T 1 Yes Yes 1.659T ( 33%) 69.43G ( 1%) dcsd_13D 6.828T 1 Yes Yes 2.81T ( 41%) 83.27G ( 1%) dcsd_13C 5T 1 Yes Yes 1.66T ( 33%) 69.1G ( 1%) dcsd_12D 6.828T 1 Yes Yes 2.81T ( 41%) 83.61G ( 1%) dcsd_12C 5T 1 Yes Yes 1.66T ( 33%) 69.42G ( 1%) dcsd_11D 6.828T 1 Yes Yes 2.811T ( 41%) 83.59G ( 1%) dcsh_10B 5T 1 Yes Yes 1.633T ( 33%) 76.97G ( 2%) dcsh_11A 5T 1 Yes Yes 1.632T ( 33%) 77.29G ( 2%) dcsh_11B 5T 1 Yes Yes 1.633T ( 33%) 76.73G ( 1%) dcsh_12A 5T 1 Yes Yes 1.634T ( 33%) 76.49G ( 1%) dcsd_11C 5T 1 Yes Yes 1.66T ( 33%) 69.25G ( 1%) dcsd_10D 6.828T 1 Yes Yes 2.811T ( 41%) 83.39G ( 1%) dcsh_10A 5T 1 Yes Yes 1.633T ( 33%) 77.06G ( 2%) dcsd_10C 5T 1 Yes Yes 1.66T ( 33%) 69.83G ( 1%) dcsd_15B 5T 1 Yes Yes 1.635T ( 33%) 76.52G ( 1%) dcsd_15A 5T 1 Yes Yes 1.634T ( 33%) 76.24G ( 1%) dcsd_14B 5T 1 Yes Yes 1.634T ( 33%) 76.31G ( 1%) dcsd_14A 5T 1 Yes Yes 1.634T ( 33%) 76.23G ( 1%) dcsd_13B 5T 1 Yes Yes 1.634T ( 33%) 76.13G ( 1%) dcsd_13A 5T 1 Yes Yes 1.634T ( 33%) 76.22G ( 1%) dcsd_12B 5T 1 Yes Yes 1.635T ( 33%) 77.49G ( 2%) dcsd_12A 5T 1 Yes Yes 1.633T ( 33%) 77.13G ( 2%) dcsd_11B 5T 1 Yes Yes 1.633T ( 33%) 76.86G ( 2%) dcsd_11A 5T 1 Yes Yes 1.632T ( 33%) 76.22G ( 1%) dcsd_10B 5T 1 Yes Yes 1.633T ( 33%) 76.79G ( 1%) dcsd_10A 5T 1 Yes Yes 1.633T ( 33%) 77.21G ( 2%) dcsh_15B 5T 1 Yes Yes 1.635T ( 33%) 76.04G ( 1%) dcsh_15A 5T 1 Yes Yes 1.634T ( 33%) 76.84G ( 2%) dcsh_14B 5T 1 Yes Yes 1.635T ( 33%) 76.75G ( 1%) dcsh_14A 5T 1 Yes Yes 1.633T ( 33%) 76.05G ( 1%) dcsh_13B 5T 1 Yes Yes 1.634T ( 33%) 76.35G ( 1%) dcsh_13A 5T 1 Yes Yes 1.634T ( 33%) 76.68G ( 1%) dcsh_12B 5T 1 Yes Yes 1.635T ( 33%) 76.74G ( 1%) ssd_5_5 80G 3 Yes No 22.31G ( 28%) 7.155G ( 9%) ssd_4_4 80G 3 Yes No 22.21G ( 28%) 7.196G ( 9%) ssd_3_3 80G 3 Yes No 22.2G ( 28%) 7.239G ( 9%) ssd_2_2 80G 3 Yes No 22.24G ( 28%) 7.146G ( 9%) ssd_1_1 80G 3 Yes No 22.29G ( 28%) 7.134G ( 9%) ------------- -------------------- ------------------- (pool total) 262.3T 92.96T ( 35%) 3.621T ( 1%) Disks in storage pool: maid4 (Maximum disk size allowed is 466 TB) ...... ------------- -------------------- ------------------- (pool total) 291T 126.5T ( 43%) 562.6G ( 0%) Disks in storage pool: maid5 (Maximum disk size allowed is 466 TB) ...... ------------- -------------------- ------------------- (pool total) 436.6T 120.8T ( 28%) 25.23G ( 0%) Disks in storage pool: maid6 (Maximum disk size allowed is 466 TB) ....... ------------- -------------------- ------------------- (pool total) 582.1T 358.7T ( 62%) 9.458G ( 0%) ============= ==================== =================== (data) 1.535P 698.9T ( 44%) 4.17T ( 0%) (metadata) 262.3T 92.96T ( 35%) 3.621T ( 1%) ============= ==================== =================== (total) 1.535P 699T ( 44%) 4.205T ( 0%) Inode Information ----------------- Number of used inodes: 79607225 Number of free inodes: 82340423 Number of allocated inodes: 161947648 Maximum number of inodes: 1342177280 I have a smaller testing FS with the same setup (with plenty of free space), and the actual sequence of commands that worked for me was: mmchfs fs1 -m1 mmrestripefs fs1 -R mmrestripefs fs1 -b mmchdisk fs1 change -F ~/nsd_metadata_test (dataAndMetadata -> dataOnly) mmrestripefs fs1 -r Could you please evaluate more on the performance overhead with having metadata on SSD+SATA? Are the read operations automatically directed to faster disks by GPFS? Is each write operation waiting for write to be finished by SATA disks? Thank you, -- Miroslav Bauer On 09/06/2016 09:06 PM, Yuri L Volobuev wrote: > > The correct way to accomplish what you're looking for (in particular, > changing the fs-wide level of replication) is mmrestripefs -R. This > command also takes care of moving data off disks now marked metadataOnly. > > The restripe job hits an error trying to move blocks of the inode > file, i.e. before it gets to actual user data blocks. Note that at > this point the metadata replication factor is still 2. This suggests > one of two possibilities: (1) there isn't enough actual free space on > the remaining metadataOnly disks, (2) there isn't enough space in some > failure groups to allocate two replicas. > > All of this assumes you're operating within a single storage pool. If > multiple storage pools are in play, there are other possibilities. > > 'mmdf' output would be helpful in providing more helpful advice. With > the information at hand, I can only suggest trying to accomplish the > task in two phases: (a) deallocated extra metadata replicas, by doing > mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do > want to point out that metadata replication is a highly recommended > insurance policy to have for your file system. As with other kinds of > insurance, you may or may not need it, but if you do end up needing > it, you'll be very glad you have it. The costs, in terms of extra > metadata space and performance overhead, are very reasonable. > > yuri > > > Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is > exactly what I meant :). Unfortunately, mmrestripefs with -R > > From: Miroslav Bauer > To: gpfsug-discuss at spectrumscale.org, > Date: 09/01/2016 07:29 AM > Subject: Re: [gpfsug-discuss] Migration to separate metadata and data > disks > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Yes, failure group id is exactly what I meant :). Unfortunately, > mmrestripefs with -R > behaves the same as with -r. I also believed that mmrestripefs -R is the > correct tool for > fixing the replication settings on inodes (according to manpages), but I > will try possible > solutions you and Marc suggested and let you know how it went. > > Thank you, > -- > Miroslav Bauer > > On 09/01/2016 04:02 PM, Aaron Knister wrote: > > Oh! I think you've already provided the info I was looking for :) I > > thought that failGroup=3 meant there were 3 failure groups within the > > SSDs. I suspect that's not at all what you meant and that actually is > > the failure group of all of those disks. That I think explains what's > > going on-- there's only one failure group's worth of metadata-capable > > disks available and as such GPFS can't place the 2nd replica for > > existing files. > > > > Here's what I would suggest: > > > > - Create at least 2 failure groups within the SSDs > > - Put the default metadata replication factor back to 2 > > - Run a restripefs -R to shuffle files around and restore the metadata > > replication factor of 2 to any files created while it was set to 1 > > > > If you're not interested in replication for metadata then perhaps all > > you need to do is the mmrestripefs -R. I think that should > > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > > > Hope that helps. > > > > -Aaron > > > > On 9/1/16 9:39 AM, Aaron Knister wrote: > >> By the way, I suspect the no space on device errors are because GPFS > >> believes for some reason that it is unable to maintain the metadata > >> replication factor of 2 that's likely set on all previously created > >> inodes. > >> > >> On 9/1/16 9:36 AM, Aaron Knister wrote: > >>> I must admit, I'm curious as to the reason you're dropping the > >>> replication factor from 2 down to 1. There are some serious advantages > >>> we've seen to having multiple metadata replicas, as far as error > >>> recovery is concerned. > >>> > >>> Could you paste an output of mmlsdisk for the filesystem? > >>> > >>> -Aaron > >>> > >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: > >>>> Hello, > >>>> > >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > >>>> filesystem metadata from state: > >>>> -m = 2 (default metadata replicas) > >>>> - SATA disks (dataAndMetadata, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> to the desired state: > >>>> -m = 1 > >>>> - SATA disks (dataOnly, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> > >>>> I have done the following steps in the following order: > >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > >>>> attribute only): > >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza > >>>> Attention: Disk parameters were changed. > >>>> Use the mmrestripefs command with the -r option to relocate > data and > >>>> metadata. > >>>> Verifying file system configuration information ... > >>>> mmchdisk: Propagating the cluster configuration data to all > >>>> affected nodes. This is an asynchronous process. > >>>> > >>>> 2) change default metadata replicas number 2->1 > >>>> # mmchfs fs1 -m 1 > >>>> > >>>> 3) run mmrestripefs as suggested by output of 1) > >>>> # mmrestripefs fs1 -r > >>>> Scanning file system metadata, phase 1 ... > >>>> Error processing inodes. > >>>> No space left on device > >>>> mmrestripefs: Command failed. Examine previous error messages to > >>>> determine cause. > >>>> > >>>> It is, however, still possible to create new files on the filesystem. > >>>> When I return one of the SATA disks as a dataAndMetadata disk, the > >>>> mmrestripefs > >>>> command stops complaining about No space left on device. Both df and > >>>> mmdf > >>>> say that there is enough space both for data (SATA) and metadata > >>>> (SSDs). > >>>> Does anyone have an idea why is it complaining? > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Miroslav Bauer > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >> > > > > > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From S.J.Thompson at bham.ac.uk Wed Sep 7 13:36:48 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Wed, 7 Sep 2016 12:36:48 +0000 Subject: [gpfsug-discuss] Remote cluster mount failing Message-ID: Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon From volobuev at us.ibm.com Wed Sep 7 17:38:03 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Wed, 7 Sep 2016 09:38:03 -0700 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz><2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov><505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: Hi Miroslav, The mmdf output is very helpful. It suggests very strongly what the problem is: > ssd_5_5?????????????????? 80G??????? 3 Yes????? No?????????? 22.31G ( 28%)??????? 7.155G ( 9%) > ssd_4_4?????????????????? 80G??????? 3 Yes????? No?????????? 22.21G ( 28%)??????? 7.196G ( 9%) > ssd_3_3?????????????????? 80G??????? 3 Yes????? No??????????? 22.2G ( 28%)??????? 7.239G ( 9%) > ssd_2_2?????????????????? 80G??????? 3 Yes????? No?????????? 22.24G ( 28%)??????? 7.146G ( 9%) > ssd_1_1?????????????????? 80G??????? 3 Yes????? No?????????? 22.29G ( 28%)??????? 7.134G ( 9%) >... > ==================== =================== > (data)???????????????? 1.535P??????????????????????????????? 698.9T ( 44%)???????? 4.17T ( 0%) > (metadata)???????????? 262.3T??????????????????????????????? 92.96T ( 35%)??????? 3.621T ( 1%) >... > Number of allocated inodes:? 161947648 > Maximum number of inodes:?? 1342177280 You have 5 80G SSDs. That's not enough. Even with metadata spread across a couple dozen more SATA disks, SSDs are over 3/4 full. There's no way to accurately estimate the amount of metadata in this file system with the data at hand, but if we (very conservatively) assume that each SATA disk has only as much metadata as each SSD, i.e. ~57G, that would greatly exceed the amount of free space available on your SSDs. You need more free metadata space. Another way to look at this: you got 1.5PB of data under management. A reasonable rule-of-thumb estimate for the amount of metadata is 1-2% of the data (this is a typical ratio, but of course every file system is different, and large deviations are possible. A degenerate case is an fs containing nothing but directories, and in this case metadata usage is 100%). So you have to have at least a few TB of metadata storage. 5 80G SSDs aren't enough for an fs of this size. > Could you please evaluate more on the performance overhead with > having metadata > on SSD+SATA? Are the read operations automatically directed to > faster disks by GPFS? > Is each write operation waiting for write to be finished by SATA disks? Mixing disks with sharply different performance characteristics within a single storage pool is detrimental to performance. GPFS stripes blocks across all disks in a storage pool, expecting all of them to be equally suitable. If SSDs are mixed with SATA disks, the overall metadata write performance is going to be bottlenecked by SATA drives. On reads, given a choice of two replicas, GPFS V4.1.1+ picks the the replica residing on the fastest disk, but given that SSDs represent only a small fraction of your total metadata usage, this likely doesn't help a whole lot. You're on the right track in trying to shift all metadata to SSDs and away from SATA, the overall file system performance will improve as the result. yuri > > Thank you, > -- > Miroslav Bauer > On 09/06/2016 09:06 PM, Yuri L Volobuev wrote: > The correct way to accomplish what you're looking for (in > particular, changing the fs-wide level of replication) is > mmrestripefs -R. This command also takes care of moving data off > disks now marked metadataOnly. > > The restripe job hits an error trying to move blocks of the inode > file, i.e. before it gets to actual user data blocks. Note that at > this point the metadata replication factor is still 2. This suggests > one of two possibilities: (1) there isn't enough actual free space > on the remaining metadataOnly disks, (2) there isn't enough space in > some failure groups to allocate two replicas. > > All of this assumes you're operating within a single storage pool. > If multiple storage pools are in play, there are other possibilities. > > 'mmdf' output would be helpful in providing more helpful advice. > With the information at hand, I can only suggest trying to > accomplish the task in two phases: (a) deallocated extra metadata > replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata > off SATA disks. I do want to point out that metadata replication is > a highly recommended insurance policy to have for your file system. > As with other kinds of insurance, you may or may not need it, but if > you do end up needing it, you'll be very glad you have it. The > costs, in terms of extra metadata space and performance overhead, > are very reasonable. > > yuri > > > Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is > exactly what I meant :). Unfortunately, mmrestripefs with -R > > From: Miroslav Bauer > To: gpfsug-discuss at spectrumscale.org, > Date: 09/01/2016 07:29 AM > Subject: Re: [gpfsug-discuss] Migration to separate metadata and data disks > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Yes, failure group id is exactly what I meant :). Unfortunately, > mmrestripefs with -R > behaves the same as with -r. I also believed that mmrestripefs -R is the > correct tool for > fixing the replication settings on inodes (according to manpages), but I > will try possible > solutions you and Marc suggested and let you know how it went. > > Thank you, > -- > Miroslav Bauer > > On 09/01/2016 04:02 PM, Aaron Knister wrote: > > Oh! I think you've already provided the info I was looking for :) I > > thought that failGroup=3 meant there were 3 failure groups within the > > SSDs. I suspect that's not at all what you meant and that actually is > > the failure group of all of those disks. That I think explains what's > > going on-- there's only one failure group's worth of metadata-capable > > disks available and as such GPFS can't place the 2nd replica for > > existing files. > > > > Here's what I would suggest: > > > > - Create at least 2 failure groups within the SSDs > > - Put the default metadata replication factor back to 2 > > - Run a restripefs -R to shuffle files around and restore the metadata > > replication factor of 2 to any files created while it was set to 1 > > > > If you're not interested in replication for metadata then perhaps all > > you need to do is the mmrestripefs -R. I think that should > > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > > > Hope that helps. > > > > -Aaron > > > > On 9/1/16 9:39 AM, Aaron Knister wrote: > >> By the way, I suspect the no space on device errors are because GPFS > >> believes for some reason that it is unable to maintain the metadata > >> replication factor of 2 that's likely set on all previously created > >> inodes. > >> > >> On 9/1/16 9:36 AM, Aaron Knister wrote: > >>> I must admit, I'm curious as to the reason you're dropping the > >>> replication factor from 2 down to 1. There are some serious advantages > >>> we've seen to having multiple metadata replicas, as far as error > >>> recovery is concerned. > >>> > >>> Could you paste an output of mmlsdisk for the filesystem? > >>> > >>> -Aaron > >>> > >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: > >>>> Hello, > >>>> > >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > >>>> filesystem metadata from state: > >>>> -m = 2 (default metadata replicas) > >>>> - SATA disks (dataAndMetadata, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> to the desired state: > >>>> -m = 1 > >>>> - SATA disks (dataOnly, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> > >>>> I have done the following steps in the following order: > >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > >>>> attribute only): > >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza > >>>> Attention: Disk parameters were changed. > >>>> ? Use the mmrestripefs command with the -r option to relocate data and > >>>> metadata. > >>>> Verifying file system configuration information ... > >>>> mmchdisk: Propagating the cluster configuration data to all > >>>> ? affected nodes. ?This is an asynchronous process. > >>>> > >>>> 2) change default metadata replicas number 2->1 > >>>> # mmchfs fs1 -m 1 > >>>> > >>>> 3) run mmrestripefs as suggested by output of 1) > >>>> # mmrestripefs fs1 -r > >>>> Scanning file system metadata, phase 1 ... > >>>> Error processing inodes. > >>>> No space left on device > >>>> mmrestripefs: Command failed. ?Examine previous error messages to > >>>> determine cause. > >>>> > >>>> It is, however, still possible to create new files on the filesystem. > >>>> When I return one of the SATA disks as a dataAndMetadata disk, the > >>>> mmrestripefs > >>>> command stops complaining about No space left on device. Both df and > >>>> mmdf > >>>> say that there is enough space both for data (SATA) and metadata > >>>> (SSDs). > >>>> Does anyone have an idea why is it complaining? > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Miroslav Bauer > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >> > > > > > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Wed Sep 7 17:58:07 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Wed, 7 Sep 2016 09:58:07 -0700 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Valdis.Kletnieks at vt.edu Wed Sep 7 19:45:43 2016 From: Valdis.Kletnieks at vt.edu (Valdis Kletnieks) Date: Wed, 07 Sep 2016 14:45:43 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' Message-ID: <27691.1473273943@turing-police.cc.vt.edu> We're in the middle of deploying Spectrum Archive, and I've hit a snag. We assigned some floating IP addresses, which now need to be changed. So I look at the mmces manpage, and it looks like I need to add the new addresses, and delete the old ones. We're on GPFS 4.2.1.0, if that matters... What 'man mmces' says: 1. To add an address to a specified node, issue this command: mmces address add --ces-node node1 --ces-ip 10.1.2.3 (and at least 6 or 8 more uses of an IP address). What happens when I try it: (And yes, we have an 'isb' ces-group defined with addresses in it already) # mmces address add --ces-group isb --ces-ip 172.28.45.72 Cannot resolve 172.28.45.72; Name or service not known mmces address add: Incorrect value for --ces-ip option Usage: mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group Group] {--ces-ip {IP[,IP...]} Am I missing some special sauce? (My first guess is that it's complaining because there's no PTR in the DNS for that address yet - but if it was going to do DNS lookups, it should be valid to give a hostname rather than an IP address (and nowhere in the manpage does it even *hint* that --ces-ip can be anything other than a list of IP addresses). Or is it time for me to file a PMR? From xhejtman at ics.muni.cz Wed Sep 7 21:11:11 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 7 Sep 2016 22:11:11 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: <20160907201111.xmksazqjekk2ihsy@ics.muni.cz> On Tue, Sep 06, 2016 at 02:04:36PM +0200, Dominic Mueller-Wicke01 wrote: > Hi Miroslav, > > please use the command: > dsmrecall -resident -detail > or use it with file lists well, it looks like Client Version 7, Release 1, Level 4.4 leaks file descriptors: 09/07/2016 21:03:07 ANS1587W Unable to read extended attributes for object /exports/tape_tape/VO_metacentrum/home/jfeit/atlases/atlases/novo3/atlases/images/.svn/prop-base due to errno: 24, reason: Too many open files after about 15 minutes of run, I can see 88 opened files in /proc/$PID/fd when using: dsmrecall -R -RESid -D /path/* is it something known fixed in newer versions? -- Luk?? Hejtm?nek From taylorm at us.ibm.com Wed Sep 7 21:40:13 2016 From: taylorm at us.ibm.com (Michael L Taylor) Date: Wed, 7 Sep 2016 13:40:13 -0700 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: Message-ID: Can't be for certain this is what you're hitting but reverse DNS lookup is documented the KC: http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_protocolnodeipfurtherconfig.htm Note: All CES IPs must have an associated hostname and reverse DNS lookup must be configured for each. For more information, see Adding export IPs in Deploying protocols. http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_deployingprotocolstasks.htm Note: Export IPs must have an associated hostname and reverse DNS lookup must be configured for each. Can you make sure the IPs have reverse DNS lookup and try again? Will get the mmces man page updated for address add -------------- next part -------------- An HTML attachment was scrubbed... URL: From Valdis.Kletnieks at vt.edu Wed Sep 7 22:23:30 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Wed, 07 Sep 2016 17:23:30 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: Message-ID: <41089.1473283410@turing-police.cc.vt.edu> On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > Can't be for certain this is what you're hitting but reverse DNS lookup is > documented the KC: > Note: All CES IPs must have an associated hostname and reverse DNS lookup > must be configured for each. For more information, see Adding export IPs in > Deploying protocols. Bingo. That was it. Since the DNS will take a while to fix, I fed the appropriate entries to /etc/hosts and it worked fine. I got thrown for a loop because if there is enough code to do that checking, it should be able to accept a hostname as well (RFE time? :) From ulmer at ulmer.org Wed Sep 7 22:34:07 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 7 Sep 2016 17:34:07 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: <41089.1473283410@turing-police.cc.vt.edu> References: <41089.1473283410@turing-police.cc.vt.edu> Message-ID: Hostnames can have many A records. IPs *generally* only have one PTR (though it?s not restricted, multiple PTRs is not recommended). Just knowing that you can see why allowing names would create more questions than it answers. So if it did take names instead of IP addresses, it would usually only do what you meant part of the time -- and sometimes none of the time. :) -- Stephen > On Sep 7, 2016, at 5:23 PM, Valdis.Kletnieks at vt.edu wrote: > > On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > >> Can't be for certain this is what you're hitting but reverse DNS lookup is >> documented the KC: > >> Note: All CES IPs must have an associated hostname and reverse DNS lookup >> must be configured for each. For more information, see Adding export IPs in >> Deploying protocols. > > Bingo. That was it. Since the DNS will take a while to fix, I fed > the appropriate entries to /etc/hosts and it worked fine. > > I got thrown for a loop because if there is enough code to do that checking, > it should be able to accept a hostname as well (RFE time? :) > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Valdis.Kletnieks at vt.edu Wed Sep 7 22:54:05 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Wed, 07 Sep 2016 17:54:05 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: <41089.1473283410@turing-police.cc.vt.edu> Message-ID: <43934.1473285245@turing-police.cc.vt.edu> On Wed, 07 Sep 2016 17:34:07 -0400, Stephen Ulmer said: > Hostnames can have many A records. And quad-A records. :) (Despite our best efforts, we're still one of the 100 biggest IPv6 deployments according to http://www.worldipv6launch.org/measurements/ - were's sitting at 84th in traffic volume and 18th by percent penetration, mostly because we deployed it in production literally last century...) From janfrode at tanso.net Thu Sep 8 06:08:47 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 08 Sep 2016 05:08:47 +0000 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: <27691.1473273943@turing-police.cc.vt.edu> References: <27691.1473273943@turing-police.cc.vt.edu> Message-ID: I believe your first guess is correct. The ces-ip needs to be resolvable for some reason... Just put a name for it in /etc/hosts, if you can't add it to your dns. -jf ons. 7. sep. 2016 kl. 20.45 skrev Valdis Kletnieks : > We're in the middle of deploying Spectrum Archive, and I've hit a > snag. We assigned some floating IP addresses, which now need to > be changed. So I look at the mmces manpage, and it looks like I need > to add the new addresses, and delete the old ones. > > We're on GPFS 4.2.1.0, if that matters... > > What 'man mmces' says: > > 1. To add an address to a specified node, issue this command: > > mmces address add --ces-node node1 --ces-ip 10.1.2.3 > > (and at least 6 or 8 more uses of an IP address). > > What happens when I try it: (And yes, we have an 'isb' ces-group defined > with > addresses in it already) > > # mmces address add --ces-group isb --ces-ip 172.28.45.72 > Cannot resolve 172.28.45.72; Name or service not known > mmces address add: Incorrect value for --ces-ip option > Usage: > mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group > Group] > {--ces-ip {IP[,IP...]} > > Am I missing some special sauce? (My first guess is that it's complaining > because there's no PTR in the DNS for that address yet - but if it was > going > to do DNS lookups, it should be valid to give a hostname rather than an IP > address (and nowhere in the manpage does it even *hint* that --ces-ip can > be anything other than a list of IP addresses). > > Or is it time for me to file a PMR? > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominic.mueller at de.ibm.com Thu Sep 8 06:35:55 2016 From: dominic.mueller at de.ibm.com (Dominic Mueller-Wicke01) Date: Thu, 8 Sep 2016 07:35:55 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: Please open a PMR for the not working "recall to resident". Some investigation is needed here. Thanks. Greetings, Dominic. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 07.09.2016 23:23 Subject: gpfsug-discuss Digest, Vol 56, Issue 14 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Remote cluster mount failing (Yuri L Volobuev) 2. Weirdness with 'mmces address add' (Valdis Kletnieks) 3. Re: DMAPI - Unmigrate file to Regular state (Lukas Hejtmanek) 4. Weirdness with 'mmces address add' (Michael L Taylor) 5. Re: Weirdness with 'mmces address add' (Valdis.Kletnieks at vt.edu) ----- Message from "Yuri L Volobuev" on Wed, 7 Sep 2016 09:58:07 -0700 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- Message from Valdis Kletnieks on Wed, 07 Sep 2016 14:45:43 -0400 ----- To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Weirdness with 'mmces address add' We're in the middle of deploying Spectrum Archive, and I've hit a snag. We assigned some floating IP addresses, which now need to be changed. So I look at the mmces manpage, and it looks like I need to add the new addresses, and delete the old ones. We're on GPFS 4.2.1.0, if that matters... What 'man mmces' says: 1. To add an address to a specified node, issue this command: mmces address add --ces-node node1 --ces-ip 10.1.2.3 (and at least 6 or 8 more uses of an IP address). What happens when I try it: (And yes, we have an 'isb' ces-group defined with addresses in it already) # mmces address add --ces-group isb --ces-ip 172.28.45.72 Cannot resolve 172.28.45.72; Name or service not known mmces address add: Incorrect value for --ces-ip option Usage: mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group Group] {--ces-ip {IP[,IP...]} Am I missing some special sauce? (My first guess is that it's complaining because there's no PTR in the DNS for that address yet - but if it was going to do DNS lookups, it should be valid to give a hostname rather than an IP address (and nowhere in the manpage does it even *hint* that --ces-ip can be anything other than a list of IP addresses). Or is it time for me to file a PMR? ----- Message from Lukas Hejtmanek on Wed, 7 Sep 2016 22:11:11 +0200 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state On Tue, Sep 06, 2016 at 02:04:36PM +0200, Dominic Mueller-Wicke01 wrote: > Hi Miroslav, > > please use the command: > dsmrecall -resident -detail > or use it with file lists well, it looks like Client Version 7, Release 1, Level 4.4 leaks file descriptors: 09/07/2016 21:03:07 ANS1587W Unable to read extended attributes for object /exports/tape_tape/VO_metacentrum/home/jfeit/atlases/atlases/novo3/atlases/images/.svn/prop-base due to errno: 24, reason: Too many open files after about 15 minutes of run, I can see 88 opened files in /proc/$PID/fd when using: dsmrecall -R -RESid -D /path/* is it something known fixed in newer versions? -- Luk?? Hejtm?nek ----- Message from "Michael L Taylor" on Wed, 7 Sep 2016 13:40:13 -0700 ----- To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Weirdness with 'mmces address add' Can't be for certain this is what you're hitting but reverse DNS lookup is documented the KC: http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_protocolnodeipfurtherconfig.htm Note: All CES IPs must have an associated hostname and reverse DNS lookup must be configured for each. For more information, see Adding export IPs in Deploying protocols. http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_deployingprotocolstasks.htm Note: Export IPs must have an associated hostname and reverse DNS lookup must be configured for each. Can you make sure the IPs have reverse DNS lookup and try again? Will get the mmces man page updated for address add ----- Message from Valdis.Kletnieks at vt.edu on Wed, 07 Sep 2016 17:23:30 -0400 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Weirdness with 'mmces address add' On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > Can't be for certain this is what you're hitting but reverse DNS lookup is > documented the KC: > Note: All CES IPs must have an associated hostname and reverse DNS lookup > must be configured for each. For more information, see Adding export IPs in > Deploying protocols. Bingo. That was it. Since the DNS will take a while to fix, I fed the appropriate entries to /etc/hosts and it worked fine. I got thrown for a loop because if there is enough code to do that checking, it should be able to accept a hostname as well (RFE time? :) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Fri Sep 9 15:37:28 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 9 Sep 2016 14:37:28 +0000 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: > on behalf of Yuri L Volobuev > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri [Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to]"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" > To: "gpfsug-discuss at spectrumscale.org" >, Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From volobuev at us.ibm.com Fri Sep 9 17:29:35 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Fri, 9 Sep 2016 09:29:35 -0700 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: It could be "easy" in the end, e.g. regenerating the key ("mmauth genkey new") may fix the issue. Figuring out exactly what is going wrong is messy though, and requires looking at a number of debug data points, something that's awkward to do on a public mailing list. I don't think you want to post certificates et al on a mailing list. The PMR channel is more appropriate for this kind of thing. yuri From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list , Date: 09/09/2016 07:37 AM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: on behalf of Yuri L Volobuev Reply-To: "gpfsug-discuss at spectrumscale.org" < gpfsug-discuss at spectrumscale.org> Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" < S.J.Thompson at bham.ac.uk> To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "graycol.gif" deleted by Yuri L Volobuev/Austin/IBM] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bbanister at jumptrading.com Sat Sep 10 22:50:25 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Sat, 10 Sep 2016 21:50:25 +0000 Subject: [gpfsug-discuss] Edge Attendees In-Reply-To: References: Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB063297AB@CHI-EXCHANGEW1.w2k.jumptrading.com> Hi Doug, Found out that I get to attend this year. Please put me down for the SS NDA round-table, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Douglas O'flaherty Sent: Monday, August 29, 2016 12:34 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Edge Attendees Greetings: I am organizing an NDA round-table with the IBM Offering Managers at IBM Edge on Tuesday, September 20th at 1pm. The subject will be "The Future of IBM Spectrum Scale." IBM Offering Managers are the Product Owners at IBM. There will be discussions covering licensing, the roadmap for IBM Spectrum Scale RAID (aka GNR), new hardware platforms, etc. This is a unique opportunity to get feedback to the drivers of the IBM Spectrum Scale business plans. It should be a great companion to the content we get from Engineering and Research at most User Group meetings. To get an invitation, please email me privately at douglasof us.ibm.com. All who have a valid NDA are invited. I only need an approximate headcount of attendees. Try not to spam the mailing list. I am pushing to get the Offering Managers to have a similar session at SC16 as an IBM Multi-client Briefing. You can add your voice to that call on this thread, or email me directly. Spectrum Scale User Group at SC16 will once again take place on Sunday afternoon with cocktails to follow. I hope we can blow out the attendance numbers and the number of site speakers we had last year! I know Simon, Bob, and Kristy are already working the agenda. Get your ideas in to them or to me. See you in Vegas, Vegas, SLC, Vegas this Fall... Maybe Australia in between? doug Douglas O'Flaherty IBM Spectrum Storage Marketing ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Sun Sep 11 22:02:48 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Sun, 11 Sep 2016 21:02:48 +0000 Subject: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> Hi Everyone, A while back I seem to recall hearing about a mechanism being developed that would function similarly to Lustre's LNET routers and effectively allow a single set of NSD servers to talk to multiple RDMA fabrics without requiring the NSD servers to have infiniband interfaces on each RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each fabric that would in effect proxy the RDMA requests to the NSD server. Does anyone know what I'm talking about? Just curious if it's still on the roadmap. -Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Sun Sep 11 23:31:56 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Sun, 11 Sep 2016 22:31:56 +0000 Subject: [gpfsug-discuss] Grafana Bridge Code - for GPFS Performance Sensors - Now on the IBM Wiki Message-ID: <2B003708-B2E3-474B-8035-F3A080CB2EAF@nuance.com> IBM has formally published this bridge code - and you can get the details and download it here: IBM Spectrum Scale Performance Monitoring Bridge https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/IBM%20Spectrum%20Scale%20Performance Monitoring%20Bridge Also, see this Storage Community Blog Post (it references version 4.2.2, but I think they mean 4.2.1) http://storagecommunity.org/easyblog/entry/performance-data-graphical-visualization-for-ibm-spectrum-scale-environment I've been using it for a while - if you have any questions, let me know. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 12 01:00:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sun, 11 Sep 2016 20:00:32 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> Message-ID: <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> After some googling around, I wonder if perhaps what I'm thinking of was an I/O forwarding layer that I understood was being developed for x86_64 type machines rather than some type of GPFS protocol router or proxy. -Aaron On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > A while back I seem to recall hearing about a mechanism being developed > that would function similarly to Lustre's LNET routers and effectively > allow a single set of NSD servers to talk to multiple RDMA fabrics > without requiring the NSD servers to have infiniband interfaces on each > RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each > fabric that would in effect proxy the RDMA requests to the NSD server. > Does anyone know what I'm talking about? Just curious if it's still on > the roadmap. > > -Aaron > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From douglasof at us.ibm.com Mon Sep 12 02:38:08 2016 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Sun, 11 Sep 2016 21:38:08 -0400 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 56, Issue 17 In-Reply-To: References: Message-ID: See you... and anyone else who can make it in Vegas in a couple weeks! From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 09/11/2016 07:00 AM Subject: gpfsug-discuss Digest, Vol 56, Issue 17 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Edge Attendees (Bryan Banister) ----- Message from Bryan Banister on Sat, 10 Sep 2016 21:50:25 +0000 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Edge Attendees Hi Doug, Found out that I get to attend this year. Please put me down for the SS NDA round-table, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Douglas O'flaherty Sent: Monday, August 29, 2016 12:34 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Edge Attendees Greetings: I am organizing an NDA round-table with the IBM Offering Managers at IBM Edge on Tuesday, September 20th at 1pm. The subject will be "The Future of IBM Spectrum Scale." IBM Offering Managers are the Product Owners at IBM. There will be discussions covering licensing, the roadmap for IBM Spectrum Scale RAID (aka GNR), new hardware platforms, etc. This is a unique opportunity to get feedback to the drivers of the IBM Spectrum Scale business plans. It should be a great companion to the content we get from Engineering and Research at most User Group meetings. To get an invitation, please email me privately at douglasof us.ibm.com. All who have a valid NDA are invited. I only need an approximate headcount of attendees. Try not to spam the mailing list. I am pushing to get the Offering Managers to have a similar session at SC16 as an IBM Multi-client Briefing. You can add your voice to that call on this thread, or email me directly. Spectrum Scale User Group at SC16 will once again take place on Sunday afternoon with cocktails to follow. I hope we can blow out the attendance numbers and the number of site speakers we had last year! I know Simon, Bob, and Kristy are already working the agenda. Get your ideas in to them or to me. See you in Vegas, Vegas, SLC, Vegas this Fall... Maybe Australia in between? doug Douglas O'Flaherty IBM Spectrum Storage Marketing Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Mon Sep 12 06:17:05 2016 From: knop at us.ibm.com (Felipe Knop) Date: Mon, 12 Sep 2016 01:17:05 -0400 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: There is a chance the problem might be related to an upgrade from 3.5 to 4.1, or perhaps a remote mount between versions 3.5 and 4.1. It would be useful to know details related to any such migration and different releases when the PMR is opened. Thanks, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Yuri L Volobuev/Austin/IBM at IBMUS To: gpfsug main discussion list Date: 09/09/2016 12:30 PM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org It could be "easy" in the end, e.g. regenerating the key ("mmauth genkey new") may fix the issue. Figuring out exactly what is going wrong is messy though, and requires looking at a number of debug data points, something that's awkward to do on a public mailing list. I don't think you want to post certificates et al on a mailing list. The PMR channel is more appropriate for this kind of thing. yuri "Simon Thompson (Research Computing - IT Services)" ---09/09/2016 07:37:52 AM---That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconf From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list , Date: 09/09/2016 07:37 AM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: on behalf of Yuri L Volobuev Reply-To: "gpfsug-discuss at spectrumscale.org" < gpfsug-discuss at spectrumscale.org> Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" < S.J.Thompson at bham.ac.uk> To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "graycol.gif" deleted by Yuri L Volobuev/Austin/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Mon Sep 12 15:48:56 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 12 Sep 2016 10:48:56 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Perhaps if you clearly describe what equipment and connections you have in place and what you're trying to accomplish, someone on this board can propose a solution. In principle, it's always possible to insert proxies/routers to "fake" any two endpoints into "believing" they are communicating directly. From: Aaron Knister To: Date: 09/11/2016 08:01 PM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org After some googling around, I wonder if perhaps what I'm thinking of was an I/O forwarding layer that I understood was being developed for x86_64 type machines rather than some type of GPFS protocol router or proxy. -Aaron On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > A while back I seem to recall hearing about a mechanism being developed > that would function similarly to Lustre's LNET routers and effectively > allow a single set of NSD servers to talk to multiple RDMA fabrics > without requiring the NSD servers to have infiniband interfaces on each > RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each > fabric that would in effect proxy the RDMA requests to the NSD server. > Does anyone know what I'm talking about? Just curious if it's still on > the roadmap. > > -Aaron > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Mon Sep 12 15:57:55 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Sep 2016 16:57:55 +0200 Subject: [gpfsug-discuss] gpfs 4.2.1 and samba export Message-ID: <20160912145755.xhx2du4c3aimkkxt@ics.muni.cz> Hello, I have GPFS version 4.2.1 on Centos 7.2 (kernel 3.10.0-327.22.2.el7.x86_64) and I have got some weird behavior of samba. Windows clients get stucked for almost 1 minute when copying files. I traced down the problematic syscall: 27887 16:39:28.000401 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.Messaging_1.10.22012.0_x86__8wekyb3d8bbwe/SkypeApp/View/HomePage.xaml", {{1473691167, 940424000}, {1473691168, 295355}}, 0) = 0 <74.999775> [...] 27887 16:44:24.000310 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.Windows.Photos_15.1001.16470.0_x64__8wekyb3d8bbwe/Assets/PhotosAppList.contrast-white_targetsize-16.png", {{1473691463, 931319000}, {1473691464, 96608}}, 0) = 0 <74.999841> [...] 27887 16:50:34.002274 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.XboxApp_9.9.30030.0_x64__8wekyb3d8bbwe/_Resources/50.rsrc", {{1473691833, 952166000}, {1473691834, 2166223}}, 0) = 0 <74.997877> [...] 27887 16:53:11.000240 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.ZuneVideo_3.6.13251.0_x64__8wekyb3d8bbwe/Styles/CommonBrushes.xbf", {{1473691990, 948668000}, {1473691991, 131221}}, 0) = 0 <74.999540> it seems that from time to time, utimensat(2) call takes over 70 (!!) seconds. Normal utimensat syscall looks like: 27887 16:55:16.238132 utimensat(AT_FDCWD, "000000-My_Documents/Windows/Installer/$PatchCache$/Managed/00004109210000000000000000F01FEC/14.0.7015/ACEODDBS.DLL", {{1473692116, 196458000}, {1351702318, 0}}, 0) = 0 <0.000065> At the same time, there is untar running. When samba freezes at utimensat call, untar continues to write data to GPFS (same fs as samba), so it does not seem to me as buffers flush. When the syscall is stucked, I/O utilization of all GPFS disks is below 10 %. mmfsadm dump waiters shows nothing waiting and any cluster node. So any ideas? Or should I just fire PMR? This is cluster config: clusterId 2745894253048382857 autoload no dmapiFileHandleSize 32 minReleaseLevel 4.2.1.0 ccrEnabled yes maxMBpS 20000 maxblocksize 8M cipherList AUTHONLY maxFilesToCache 10000 nsdSmallThreadRatio 1 nsdMaxWorkerThreads 480 ignorePrefetchLUNCount yes pagepool 48G prefetchThreads 320 worker1Threads 320 writebehindThreshhold 10485760 cifsBypassShareLocksOnRename yes cifsBypassTraversalChecking yes allowWriteWithDeleteChild yes adminMode central And this is file system config: flag value description ------------------- ------------------------ ----------------------------------- -f 65536 Minimum fragment size in bytes -i 4096 Inode size in bytes -I 32768 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k all ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 2097152 Block size -Q user;group;fileset Quotas accounting enabled user;group;fileset Quotas enforced none Default quotas enabled --perfileset-quota Yes Per-fileset quota enforcement --filesetdf Yes Fileset df enabled? -V 15.01 (4.2.0.0) File system version --create-time Wed Aug 24 17:38:40 2016 File system creation time -z No Is DMAPI enabled? -L 4194304 Logfile size -E Yes Exact mtime mount option -S No Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 402653184 Maximum number of inodes in all inode spaces --log-replicas 0 Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) -P system Disk storage pools in file system -d nsd_A_m;nsd_B_m;nsd_C_m;nsd_D_m;nsd_A_LV1_d;nsd_A_LV2_d;nsd_A_LV3_d;nsd_A_LV4_d;nsd_B_LV1_d;nsd_B_LV2_d;nsd_B_LV3_d;nsd_B_LV4_d;nsd_C_LV1_d;nsd_C_LV2_d;nsd_C_LV3_d; -d nsd_C_LV4_d;nsd_D_LV1_d;nsd_D_LV2_d;nsd_D_LV3_d;nsd_D_LV4_d Disks in file system -A yes Automatic mount option -o none Additional mount options -T /gpfs/vol1 Default mount point --mount-priority 1 Mount priority -- Luk?? Hejtm?nek From chekh at stanford.edu Mon Sep 12 20:03:15 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 12 Sep 2016 12:03:15 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Message-ID: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu From bbanister at jumptrading.com Mon Sep 12 20:06:59 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 12 Sep 2016 19:06:59 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB0632A645@CHI-EXCHANGEW1.w2k.jumptrading.com> I'd recommend running a mmcheckquota and then check mmlsquota again, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Alex Chekholko Sent: Monday, September 12, 2016 2:03 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 12 20:08:28 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 12 Sep 2016 19:08:28 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Hi Alex, While the numbers don?t match exactly, they?re close enough to prompt me to ask if data replication is possibly set to two? Thanks? Kevin On Sep 12, 2016, at 2:03 PM, Alex Chekholko > wrote: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Mon Sep 12 21:26:51 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 12 Sep 2016 20:26:51 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: My thoughts exactly. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 12 September 2016 20:08 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Hi Alex, While the numbers don?t match exactly, they?re close enough to prompt me to ask if data replication is possibly set to two? Thanks? Kevin On Sep 12, 2016, at 2:03 PM, Alex Chekholko > wrote: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Mon Sep 12 21:46:55 2016 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 12 Sep 2016 21:46:55 +0100 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: <2C38B1C8-66DB-45C6-AA5D-E612F5BFE935@qsplace.co.uk> However replicated files should show up with ls as taking about double the space. I.e. "ls -lash" 49G -r-------- 1 root root 25G Sep 12 21:11 Somefile I know you've said you checked ls vs du for allocated space it might be worth a double check. Also check that you haven't got a load of snapshots, especially if you have high file churn which will create new blocks; although with your figures it'd have to be very high file churn. -- Lauz On 12 September 2016 21:26:51 BST, "Sobey, Richard A" wrote: >My thoughts exactly. > >Richard > >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of >Buterbaugh, Kevin L >Sent: 12 September 2016 20:08 >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] big difference between output of >'mmlsquota' and 'du'? > >Hi Alex, > >While the numbers don?t match exactly, they?re close enough to prompt >me to ask if data replication is possibly set to two? Thanks? > >Kevin > >On Sep 12, 2016, at 2:03 PM, Alex Chekholko >> wrote: > >Hi, > >For a fileset with a quota on it, we have mmlsquota reporting 39TB >utilization (out of 50TB quota), with 0 in_doubt. > >Running a 'du' on the same directory (where the fileset is junctioned) >shows 21TB usage. > >I looked for sparse files (files that report different size via ls vs >du). I looked at 'du --apparent-size ...' > >https://en.wikipedia.org/wiki/Sparse_file > >What else could it be? > >Is there some attribute I can scan for inside GPFS? >Maybe where FILE_SIZE does not equal KB_ALLOCATED? >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm > > >[root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* >3977 /srv/gsfs0/projects/gbsc/Backups >1 /srv/gsfs0/projects/gbsc/benchmark >13109 /srv/gsfs0/projects/gbsc/Billing >198719 /srv/gsfs0/projects/gbsc/Clinical >1 /srv/gsfs0/projects/gbsc/Clinical_Vendors >1206523 /srv/gsfs0/projects/gbsc/Data >1 /srv/gsfs0/projects/gbsc/iPoP >123165 /srv/gsfs0/projects/gbsc/Macrogen >58676 /srv/gsfs0/projects/gbsc/Misc >6625890 /srv/gsfs0/projects/gbsc/mva >1 /srv/gsfs0/projects/gbsc/Proj >17 /srv/gsfs0/projects/gbsc/Projects >3290502 /srv/gsfs0/projects/gbsc/Resources >1 /srv/gsfs0/projects/gbsc/SeqCenter >1 /srv/gsfs0/projects/gbsc/share >514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring >1 /srv/gsfs0/projects/gbsc/TCGA_Variants >267873 /srv/gsfs0/projects/gbsc/tools >9597797 /srv/gsfs0/projects/gbsc/workspace > >(adds up to about 21TB) > >[root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 > Block Limits | File Limits >Filesystem type GB quota limit in_doubt >grace | files quota limit in_doubt grace Remarks >gsfs0 FILESET 39889 51200 51200 0 >none | 1663212 0 0 4 none > > >[root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc >projects.gbsc Linked /srv/gsfs0/projects/gbsc > >Regards, >-- >Alex Chekholko chekh at stanford.edu > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? >Kevin Buterbaugh - Senior System Administrator >Vanderbilt University - Advanced Computing Center for Research and >Education >Kevin.Buterbaugh at vanderbilt.edu >- (615)875-9633 > > > > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Sep 12 22:37:08 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 12 Sep 2016 21:37:08 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Maybe you have a huge file open, that's been unlinked and still growing? -jf -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 12 22:59:36 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 12 Sep 2016 14:59:36 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and'du'? In-Reply-To: References: Message-ID: 'du' tallies up 'blocks allocated', not file sizes. So it shouldn't matter whether any sparse files are present. GPFS doesn't charge quota for data in snapshots (whether it should is a separate question). The observed discrepancy has two plausible causes: 1) Inaccuracy in quota accounting (more likely) 2) Artefacts of data replication (less likely) Running mmcheckquota in this situation would be a good idea. yuri From: Alex Chekholko To: gpfsug-discuss at spectrumscale.org, Date: 09/12/2016 12:04 PM Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From chekh at stanford.edu Mon Sep 12 23:11:12 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 12 Sep 2016 15:11:12 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Thanks for all the responses. I will look through the filesystem clients for open file handles; we have definitely had deleted open log files of multi-TB size before. The filesystem has replication set to 1. We don't use snapshots. I'm running a 'mmrestripefs -r' (some files were ill-placed from aborted pool migrations) and then I will run an 'mmcheckquota'. On 9/12/16 2:37 PM, Jan-Frode Myklebust wrote: > Maybe you have a huge file open, that's been unlinked and still growing? > > > > -jf > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Alex Chekholko chekh at stanford.edu From xhejtman at ics.muni.cz Mon Sep 12 23:30:19 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Sep 2016 00:30:19 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek From volobuev at us.ibm.com Mon Sep 12 23:42:00 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 12 Sep 2016 15:42:00 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: The increasing value of snapId is not a problem. Creating snapshots every 15 min is somewhat more frequent than what is customary, but as long as you're able to delete filesets at the same rate you're creating them, this should work OK. yuri From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org, Date: 09/12/2016 03:30 PM Subject: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Sep 13 04:19:30 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 13 Sep 2016 03:19:30 +0000 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: Don't worry. We do 400+ snapshots every 4 hours and that number is only getting bigger. Don't know what our current snapid count is mind you, can find out when in the office. Get Outlook for Android On Mon, Sep 12, 2016 at 11:30 PM +0100, "Lukas Hejtmanek" > wrote: Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Tue Sep 13 05:06:42 2016 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Tue, 13 Sep 2016 05:06:42 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <7EAC0DD4-6FC1-4DF5-825E-9E2DD966BA4E@qsplace.co.uk> There are many people doing the same thing so nothing to worry about. As your using 4.2.1 you can at least bulk delete the snapshots using a comma separated list, making life just that little bit easier. -- Lauz On 13 September 2016 04:19:30 BST, "Sobey, Richard A" wrote: >Don't worry. We do 400+ snapshots every 4 hours and that number is only >getting bigger. Don't know what our current snapid count is mind you, >can find out when in the office. > >Get Outlook for Android > > > >On Mon, Sep 12, 2016 at 11:30 PM +0100, "Lukas Hejtmanek" >> wrote: > >Hello, > >using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 >minutes >during working hours). It seems that snapid is increasing only number. >Should >I be fine with such a number of snapshots per day? I guess we could >reach >snapid 100,000. I remove all these snapshots during night so I do not >keep >huge number of snapshots. > >-- >Luk?? Hejtm?nek >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Valdis.Kletnieks at vt.edu Tue Sep 13 05:32:24 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Tue, 13 Sep 2016 00:32:24 -0400 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <20635.1473741144@turing-police.cc.vt.edu> On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: > I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. From jtucker at pixitmedia.com Tue Sep 13 10:10:02 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Tue, 13 Sep 2016 10:10:02 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20635.1473741144@turing-police.cc.vt.edu> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> <20635.1473741144@turing-police.cc.vt.edu> Message-ID: <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: > On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: >> I guess we could reach snapid 100,000. > It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. > > What you *do* want to do is make the snap *name* meaningful, using > a timestamp or something to keep your sanity. > > mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Tue Sep 13 21:51:16 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 13 Sep 2016 13:51:16 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org Hey Yuri, ? Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From xhejtman at ics.muni.cz Tue Sep 13 21:57:52 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Sep 2016 22:57:52 +0200 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <20160913205752.3lmmfbhm25mu77j4@ics.muni.cz> Yuri et al. thank you for answers, I should be fine with snapshots as you suggest. On Mon, Sep 12, 2016 at 03:42:00PM -0700, Yuri L Volobuev wrote: > The increasing value of snapId is not a problem. Creating snapshots every > 15 min is somewhat more frequent than what is customary, but as long as > you're able to delete filesets at the same rate you're creating them, this > should work OK. > > yuri > > > > From: Lukas Hejtmanek > To: gpfsug-discuss at spectrumscale.org, > Date: 09/12/2016 03:30 PM > Subject: [gpfsug-discuss] gpfs snapshots > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hello, > > using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 > minutes > during working hours). It seems that snapid is increasing only number. > Should > I be fine with such a number of snapshots per day? I guess we could reach > snapid 100,000. I remove all these snapshots during night so I do not keep > huge number of snapshots. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek From S.J.Thompson at bham.ac.uk Tue Sep 13 22:21:59 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Tue, 13 Sep 2016 21:21:59 +0000 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com>, Message-ID: I thought the GUI implemented some form of snapshot scheduler. Personal opinion is that is the wrong place and I agree that is should be core functionality to ensure that the scheduler is running properly. But I would suggest that it might be more than just snapshots people might want to schedule. E.g. An ilm pool flush. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yuri L Volobuev [volobuev at us.ibm.com] Sent: 13 September 2016 21:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs snapshots Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri [Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is]Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From mark.bergman at uphs.upenn.edu Tue Sep 13 22:23:57 2016 From: mark.bergman at uphs.upenn.edu (mark.bergman at uphs.upenn.edu) Date: Tue, 13 Sep 2016 17:23:57 -0400 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: Your message of "Tue, 13 Sep 2016 13:51:16 -0700." References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: <19294-1473801837.563347@J_5h.TM7K.YXzn> In the message dated: Tue, 13 Sep 2016 13:51:16 -0700, The pithy ruminations from Yuri L Volobuev on were: => => Hi Jez, => => It sounds to me like the functionality that you're _really_ looking for is => an ability to to do automated snapshot management, similar to what's Yep. => available on other storage systems. For example, "create a new snapshot of => filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many Or, take a snapshot every 15min, keep the 4 most recent, expire all except 4 that were created within 6hrs, only 4 created between 6:01-24:00 hh:mm ago, and expire all-but-2 created between 24:01-48:00, etc, as we do. => examples of sysadmins rolling their own snapshot management system along => those lines, and an ability to add an expiration string as a snapshot I'd be glad to distribute our local example of this exercise. => "comment" appears to be merely an aid in keeping such DIY snapshot => management scripts a bit simpler -- not by much though. The end user would => still be on the hook for some heavy lifting, in particular figuring out a => way to run an equivalent of a cluster-aware cron with acceptable fault => tolerance semantics. That is, if a snapshot creation is scheduled, only => one node in the cluster should attempt to create the snapshot, but if that => node fails, another node needs to step in (as opposed to skipping the => scheduled snapshot creation). This is doable outside of GPFS, of course, => but is not trivial. Architecturally, the right place to implement a Ah, that part really is trivial....In our case, the snapshot program takes the filesystem name as an argument... we simply rely on the GPFS fault detection/failover. The job itself runs (via cron) on every GPFS server node, but only creates the snapshot on the server that is the active manager for the specified filesystem: ############################################################################## # Check if the node where this script is running is the GPFS manager node for the # specified filesystem manager=`/usr/lpp/mmfs/bin/mmlsmgr $filesys | grep -w "^$filesys" |awk '{print $2}'` ip addr list | grep -qw "$manager" if [ $? != 0 ] ; then # This node is not the manager...exit exit fi # else ... continue and create the snapshot ################################################################################################### => => yuri => => -- Mark Bergman voice: 215-746-4061 mark.bergman at uphs.upenn.edu fax: 215-614-0266 http://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania PGP Key: http://www.cbica.upenn.edu/sbia/bergman From jtolson at us.ibm.com Tue Sep 13 22:47:02 2016 From: jtolson at us.ibm.com (John T Olson) Date: Tue, 13 Sep 2016 14:47:02 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu><00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com>, Message-ID: We do have a general-purpose scheduler on the books as an item that is needed for a future release and as Yuri mentioned it would be cluster wide to avoid the single point of failure with tools like Cron. However, it's one of many things we want to try to get into the product and so we don't have any definite timeline yet. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list Date: 09/13/2016 02:22 PM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org I thought the GUI implemented some form of snapshot scheduler. Personal opinion is that is the wrong place and I agree that is should be core functionality to ensure that the scheduler is running properly. But I would suggest that it might be more than just snapshots people might want to schedule. E.g. An ilm pool flush. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yuri L Volobuev [volobuev at us.ibm.com] Sent: 13 September 2016 21:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs snapshots Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri [Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is]Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss (See attached file: graycol.gif) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From jtucker at pixitmedia.com Tue Sep 13 23:28:22 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Tue, 13 Sep 2016 23:28:22 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> <20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: <2336bbd5-39ca-dc0d-e1b4-7a301c6b9f2e@pixitmedia.com> Hey So yes, you're quite right - we have higher order fault tolerant cluster wide methods of dealing with such requirements already. However, I still think the end user should be empowered to be able construct such methods themselves if needs be. Yes, the comment is merely an aid [but also useful as a generic comment field] and as such could be utilised to encode basic metadata into the comment field. I'll log an RFE and see where we go from here. Cheers Jez On 13/09/16 21:51, Yuri L Volobuev wrote: > > Hi Jez, > > It sounds to me like the functionality that you're _really_ looking > for is an ability to to do automated snapshot management, similar to > what's available on other storage systems. For example, "create a new > snapshot of filesets X, Y, Z every 30 min, keep the last 16 > snapshots". I've seen many examples of sysadmins rolling their own > snapshot management system along those lines, and an ability to add an > expiration string as a snapshot "comment" appears to be merely an aid > in keeping such DIY snapshot management scripts a bit simpler -- not > by much though. The end user would still be on the hook for some heavy > lifting, in particular figuring out a way to run an equivalent of a > cluster-aware cron with acceptable fault tolerance semantics. That is, > if a snapshot creation is scheduled, only one node in the cluster > should attempt to create the snapshot, but if that node fails, another > node needs to step in (as opposed to skipping the scheduled snapshot > creation). This is doable outside of GPFS, of course, but is not > trivial. Architecturally, the right place to implement a > fault-tolerant cluster-aware scheduling framework is GPFS itself, as > the most complex pieces are already there. We have some plans for work > along those lines, but if you want to reinforce the point with an RFE, > that would be fine, too. > > yuri > > Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey > Yuri, Perhaps an RFE here, but could I suggest there isJez Tucker > ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I > suggest there is much value in > > From: Jez Tucker > To: gpfsug-discuss at spectrumscale.org, > Date: 09/13/2016 02:10 AM > Subject: Re: [gpfsug-discuss] gpfs snapshots > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Hey Yuri, > > Perhaps an RFE here, but could I suggest there is much value in > adding a -c option to mmcrsnapshot? > > Use cases: > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "Before phase 2" > > and > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "expire:GMT-2017.04.21-16.00.00" > > Ideally also: mmcrsnapshot fs1 > fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr > > Then it's easy to iterate over snapshots and subsequently > mmdelsnapshot snaps which are no longer required. > There are lots of methods to achieve this, but without external > databases / suchlike, this is rather simple and effective for end users. > > Alternatively a second comment like -expire flag as user metadata may > be preferential. > > Thoughts? > > Jez > > > On 13/09/16 05:32, _Valdis.Kletnieks at vt.edu_ > wrote: > > On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: > I guess we could reach snapid 100,000. > > It probably stores the snap ID as a 32 or 64 bit int, so 100K > is peanuts. > > What you *do* want to do is make the snap *name* meaningful, using > a timestamp or something to keep your sanity. > > mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Jez Tucker > Head of Research & Product Development > Pixit Media_ > __www.pixitmedia.com_ > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other > person. Please notify the sender immediately and delete this email > from your computer system. Any opinions expressed are not necessarily > those of the company from which this email was sent and, whilst to the > best of our knowledge no viruses or defects exist, no responsibility > can be accepted for any loss or damage arising from its receipt or > subsequent use of this > email._______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media Mobile: +44 (0) 776 419 3820 www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From service at metamodul.com Wed Sep 14 19:10:37 2016 From: service at metamodul.com (service at metamodul.com) Date: Wed, 14 Sep 2016 20:10:37 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: Why not use a GPFS user extented attribut for that ? In a certain way i see GPFS as a database. ^_^ Hajo Von Samsung Mobile gesendet
-------- Urspr?ngliche Nachricht --------
Von: Jez Tucker
Datum:2016.09.13 11:10 (GMT+01:00)
An: gpfsug-discuss at spectrumscale.org
Betreff: Re: [gpfsug-discuss] gpfs snapshots
Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Wed Sep 14 19:21:20 2016 From: service at metamodul.com (service at metamodul.com) Date: Wed, 14 Sep 2016 20:21:20 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: <4fojjlpuwqoalkffaahy7snf.1473877280415@email.android.com> I am missing since ages such a framework. I had my simple one devoloped on the gpfs callbacks which allowed me to have a centralized cron (HA) up to oracle also ?high available and ha nfs on Aix. Hajo Universal Inventor? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Wed Sep 14 19:49:36 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Wed, 14 Sep 2016 19:49:36 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: Message-ID: Hi I still think I'm coming down on the side of simplistic ease of use: Example: [jtucker at pixstor ~]# mmlssnapshot mmfs1 Snapshots in file system mmfs1: Directory SnapId Status Created Fileset Comment @GMT-2016.09.13-23.00.14 551 Valid Wed Sep 14 00:00:02 2016 myproject Prior to phase 1 @GMT-2016.09.14-05.00.14 552 Valid Wed Sep 14 06:00:01 2016 myproject Added this and that @GMT-2016.09.14-11.00.14 553 Valid Wed Sep 14 12:00:01 2016 myproject Merged project2 @GMT-2016.09.14-17.00.14 554 Valid Wed Sep 14 18:00:02 2016 myproject Before clean of .xmp @GMT-2016.09.14-17.05.30 555 Valid Wed Sep 14 18:05:03 2016 myproject Archival Jez On 14/09/16 19:10, service at metamodul.com wrote: > Why not use a GPFS user extented attribut for that ? > In a certain way i see GPFS as a database. ^_^ > Hajo > > > > Von Samsung Mobile gesendet > > > -------- Urspr?ngliche Nachricht -------- > Von: Jez Tucker > Datum:2016.09.13 11:10 (GMT+01:00) > An: gpfsug-discuss at spectrumscale.org > Betreff: Re: [gpfsug-discuss] gpfs snapshots > > Hey Yuri, > > Perhaps an RFE here, but could I suggest there is much value in > adding a -c option to mmcrsnapshot? > > Use cases: > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "Before phase 2" > > and > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "expire:GMT-2017.04.21-16.00.00" > > Ideally also: mmcrsnapshot fs1 > fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr > > Then it's easy to iterate over snapshots and subsequently > mmdelsnapshot snaps which are no longer required. > There are lots of methods to achieve this, but without external > databases / suchlike, this is rather simple and effective for end users. > > Alternatively a second comment like -expire flag as user metadata may > be preferential. > > Thoughts? > > Jez > > > On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: >> On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: >>> I guess we could reach snapid 100,000. >> It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. >> >> What you *do* want to do is make the snap *name* meaningful, using >> a timestamp or something to keep your sanity. >> >> mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Jez Tucker > Head of Research & Product Development > Pixit Media > www.pixitmedia.com > -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From secretary at gpfsug.org Thu Sep 15 09:42:54 2016 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 15 Sep 2016 09:42:54 +0100 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Message-ID: Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 [1] Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.botcherby at kcl.ac.uk Thu Sep 15 09:45:47 2016 From: peter.botcherby at kcl.ac.uk (Botcherby, Peter) Date: Thu, 15 Sep 2016 08:45:47 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: Hi Claire, Hope you are well - I will be away for this as going to Indonesia on the 18th October for my nephew?s wedding. Regards Peter From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Secretary GPFS UG Sent: 15 September 2016 09:43 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mimarsh2 at vt.edu Thu Sep 15 17:49:27 2016 From: mimarsh2 at vt.edu (Brian Marshall) Date: Thu, 15 Sep 2016 12:49:27 -0400 Subject: [gpfsug-discuss] EDR and omnipath Message-ID: All, I see in the GPFS FAQ A6.3 the statement below. Is it possible to have GPFS do RDMA over EDR infiniband and non-RDMA communication over omnipath (IP over fabric) when each NSD server has an EDR card and a OPA card installed? RDMA is not supported on a node when both Mellanox HCAs and Intel Omni-Path HFIs are enabled for RDMA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Sep 15 16:33:17 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 15 Sep 2016 15:33:17 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Message-ID: Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname -j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It's mildly irritating to say the least! Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Sep 16 15:21:58 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 16 Sep 2016 10:21:58 -0400 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" To: "'gpfsug-discuss at spectrumscale.org'" Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Sep 16 15:40:52 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 16 Sep 2016 14:40:52 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: Thanks Marc. Regrettably in this case, the only way I knew to delete a snapshot (listed below) has broken going from 3.5 to 4.2.1. Creating snaps has suffered the same fate. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: 16 September 2016 15:22 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Sanchez at deshaw.com Fri Sep 16 20:49:14 2016 From: Paul.Sanchez at deshaw.com (Sanchez, Paul) Date: Fri, 16 Sep 2016 19:49:14 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: <3e1f02b30e1a49ef950de7910801f5d1@mbxtoa1.winmail.deshaw.com> The old syntax works unless have a colon in your snapshot names. In that case, the portion before the first colon will be interpreted as a fileset name. So if you use RFC 3339/ISO 8601 date/times, that?s a problem: The syntax for creating and deleting snapshots goes from this: mm{cr|del}snapshot fs100 SNAP at 2016-07-31T13:00:07Z ?j 1000466 to this: mm{cr|del}snapshot fs100 1000466:SNAP at 2016-07-31T13:00:07Z If you are dealing with filesystem level snapshots then you just need a leading colon: mm{cr|del}snapshot fs100 :SNAP at 2016-07-31T13:00:07Z Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Friday, September 16, 2016 10:22 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Mon Sep 19 08:11:38 2016 From: viccornell at gmail.com (Vic Cornell) Date: Mon, 19 Sep 2016 08:11:38 +0100 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> Bump I can see no reason why that wouldn't work. But it would be nice to a have an official answer or evidence that it works. Vic > On 15 Sep 2016, at 5:49 pm, Brian Marshall wrote: > > All, > > I see in the GPFS FAQ A6.3 the statement below. Is it possible to have GPFS do RDMA over EDR infiniband and non-RDMA communication over omnipath (IP over fabric) when each NSD server has an EDR card and a OPA card installed? > > > > RDMA is not supported on a node when both Mellanox HCAs and Intel Omni-Path HFIs are enabled for RDMA. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Sep 19 20:18:18 2016 From: mweil at wustl.edu (Matt Weil) Date: Mon, 19 Sep 2016 14:18:18 -0500 Subject: [gpfsug-discuss] increasing inode Message-ID: All, What exactly happens that makes the clients hang when a file set inodes are increased? ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From aaron.s.knister at nasa.gov Mon Sep 19 21:34:53 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 16:34:53 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> References: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> Message-ID: <6f33c20f-ff8f-9ccc-4609-920b35058138@nasa.gov> I must admit, I'm curious as to why one cannot use GPFS with IB and OPA both in RDMA mode. Granted, I know very little about OPA but if it just presents as another verbs device I wonder why it wouldn't "Just work" as long as GPFS is configured correctly. -Aaron On 9/19/16 3:11 AM, Vic Cornell wrote: > Bump > > I can see no reason why that wouldn't work. But it would be nice to a > have an official answer or evidence that it works. > > Vic > > >> On 15 Sep 2016, at 5:49 pm, Brian Marshall > > wrote: >> >> All, >> >> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >> omnipath (IP over fabric) when each NSD server has an EDR card and a >> OPA card installed? >> >> >> >> RDMA is not supported on a node when both Mellanox HCAs and Intel >> Omni-Path HFIs are enabled for RDMA. >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at us.ibm.com Mon Sep 19 21:43:31 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Mon, 19 Sep 2016 20:43:31 +0000 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: <6f33c20f-ff8f-9ccc-4609-920b35058138@nasa.gov> Message-ID: Because they both require a different distribution of OFED, which are mutual exclusive to install. in theory if you deploy plain OFED it might work, but that will be hard to find somebody to support. Sent from IBM Verse Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- From:"Aaron Knister" To:gpfsug-discuss at spectrumscale.orgDate:Mon, Sep 19, 2016 1:35 PMSubject:Re: [gpfsug-discuss] EDR and omnipath I must admit, I'm curious as to why one cannot use GPFS with IB and OPA both in RDMA mode. Granted, I know very little about OPA but if it just presents as another verbs device I wonder why it wouldn't "Just work" as long as GPFS is configured correctly.-AaronOn 9/19/16 3:11 AM, Vic Cornell wrote:> Bump>> I can see no reason why that wouldn't work. But it would be nice to a> have an official answer or evidence that it works.>> Vic>>>> On 15 Sep 2016, at 5:49 pm, Brian Marshall > > wrote:>>>> All,>>>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over>> omnipath (IP over fabric) when each NSD server has an EDR card and a>> OPA card installed?>>>>>>>> RDMA is not supported on a node when both Mellanox HCAs and Intel>> Omni-Path HFIs are enabled for RDMA.>>>> _______________________________________________>> gpfsug-discuss mailing list>> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>>>> _______________________________________________> gpfsug-discuss mailing list> gpfsug-discuss at spectrumscale.org> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>-- Aaron KnisterNASA Center for Climate Simulation (Code 606.2)Goddard Space Flight Center(301) 286-2776_______________________________________________gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 19 21:55:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 16:55:32 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: Ah, that makes complete sense. Thanks! I had been doing some reading about OmniPath and for some reason was under the impression the OmniPath adapter could load itself as a driver under the verbs stack of OFED. Even so, that raises support concerns as you say. I wonder what folks are doing who have IB-based block storage fabrics but wanting to connect to OmniPath-based fabrics? I'm also curious how GNR customers would be able to serve both IB-based and an OmniPath-based fabrics over RDMA where performance is best? This is is along the lines of my GPFS protocol router question from the other day. -Aaron On 9/19/16 4:43 PM, Sven Oehme wrote: > Because they both require a different distribution of OFED, which are > mutual exclusive to install. > in theory if you deploy plain OFED it might work, but that will be hard > to find somebody to support. > > > Sent from IBM Verse > > Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- > > From: "Aaron Knister" > To: gpfsug-discuss at spectrumscale.org > Date: Mon, Sep 19, 2016 1:35 PM > Subject: Re: [gpfsug-discuss] EDR and omnipath > > ------------------------------------------------------------------------ > > I must admit, I'm curious as to why one cannot use GPFS with IB and OPA > both in RDMA mode. Granted, I know very little about OPA but if it just > presents as another verbs device I wonder why it wouldn't "Just work" as > long as GPFS is configured correctly. > > -Aaron > > On 9/19/16 3:11 AM, Vic Cornell wrote: >> Bump >> >> I can see no reason why that wouldn't work. But it would be nice to a >> have an official answer or evidence that it works. >> >> Vic >> >> >>> On 15 Sep 2016, at 5:49 pm, Brian Marshall >> > wrote: >>> >>> All, >>> >>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >>> omnipath (IP over fabric) when each NSD server has an EDR card and a >>> OPA card installed? >>> >>> >>> >>> RDMA is not supported on a node when both Mellanox HCAs and Intel >>> Omni-Path HFIs are enabled for RDMA. >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Sep 19 22:03:51 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 17:03:51 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: <99103c73-baf0-f421-f64d-1d5ee916d340@nasa.gov> Here's where I read about the inter-operability of the two: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/omni-path-storage-white-paper.pdf This is what Intel says: > In a multi-homed file system server, or in a Lustre Networking (LNet) or IP router, a single OpenFabrics Al- liance (OFA) software environment supporting both an Intel OPA HFI and a Mellanox* InfiniBand HCA is required. The OFA software stack is architected to support multiple tar- geted network types. Currently, the OFA stack simultaneously supports iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and InfiniBand networks, and the Intel OPA network has been added to that list. As the OS distributions implement their OFA stacks, it will be validated to simultaneously support both Intel OPA Host > Intel is working closely with the major Linux distributors, including Red Hat* and SUSE*, to ensure that Intel OPA support is integrated into their OFA implementation. Once this is accomplished, then simultaneous Mellanox InfiniBand and Intel OPA support will be present in the standard Linux distributions. So it seems as though Intel is relying on the OS vendors to bridge the support gap between them and Mellanox. -Aaron On 9/19/16 4:55 PM, Aaron Knister wrote: > Ah, that makes complete sense. Thanks! > > I had been doing some reading about OmniPath and for some reason was > under the impression the OmniPath adapter could load itself as a driver > under the verbs stack of OFED. Even so, that raises support concerns as > you say. > > I wonder what folks are doing who have IB-based block storage fabrics > but wanting to connect to OmniPath-based fabrics? > > I'm also curious how GNR customers would be able to serve both IB-based > and an OmniPath-based fabrics over RDMA where performance is best? This > is is along the lines of my GPFS protocol router question from the other > day. > > -Aaron > > On 9/19/16 4:43 PM, Sven Oehme wrote: >> Because they both require a different distribution of OFED, which are >> mutual exclusive to install. >> in theory if you deploy plain OFED it might work, but that will be hard >> to find somebody to support. >> >> >> Sent from IBM Verse >> >> Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- >> >> From: "Aaron Knister" >> To: gpfsug-discuss at spectrumscale.org >> Date: Mon, Sep 19, 2016 1:35 PM >> Subject: Re: [gpfsug-discuss] EDR and omnipath >> >> ------------------------------------------------------------------------ >> >> I must admit, I'm curious as to why one cannot use GPFS with IB and OPA >> both in RDMA mode. Granted, I know very little about OPA but if it just >> presents as another verbs device I wonder why it wouldn't "Just work" as >> long as GPFS is configured correctly. >> >> -Aaron >> >> On 9/19/16 3:11 AM, Vic Cornell wrote: >>> Bump >>> >>> I can see no reason why that wouldn't work. But it would be nice to a >>> have an official answer or evidence that it works. >>> >>> Vic >>> >>> >>>> On 15 Sep 2016, at 5:49 pm, Brian Marshall >>> > wrote: >>>> >>>> All, >>>> >>>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >>>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >>>> omnipath (IP over fabric) when each NSD server has an EDR card and a >>>> OPA card installed? >>>> >>>> >>>> >>>> RDMA is not supported on a node when both Mellanox HCAs and Intel >>>> Omni-Path HFIs are enabled for RDMA. >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Tue Sep 20 14:22:51 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 20 Sep 2016 09:22:51 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Tue Sep 20 15:01:49 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 20 Sep 2016 10:01:49 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov><712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 20 15:07:38 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Tue, 20 Sep 2016 14:07:38 +0000 Subject: [gpfsug-discuss] GPFS Routers References: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D24844@NDMSMBX404.ndc.nasa.gov> Not sure if this image will go through but here's one I found: [X] The "Routers" are LNET routers. LNET is just the name of lustre's network stack. The LNET routers "route" the Lustre protocol between disparate network types (quadrics, Ethernet, myrinet, carrier pigeon). Packet loss on carrier pigeon is particularly brutal, though. From: Marc A Kaplan Sent: 9/20/16, 10:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 20 15:08:46 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Tue, 20 Sep 2016 14:08:46 +0000 Subject: [gpfsug-discuss] GPFS Routers References: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> Looks like the attachment got scrubbed. Here's the link http://docplayer.net/docs-images/39/19199001/images/7-0.png[X] From: aaron.s.knister at nasa.gov Sent: 9/20/16, 10:07 AM To: gpfsug main discussion list, gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Not sure if this image will go through but here's one I found: [X] The "Routers" are LNET routers. LNET is just the name of lustre's network stack. The LNET routers "route" the Lustre protocol between disparate network types (quadrics, Ethernet, myrinet, carrier pigeon). Packet loss on carrier pigeon is particularly brutal, though. From: Marc A Kaplan Sent: 9/20/16, 10:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 20 15:30:43 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 20 Sep 2016 10:30:43 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> References: [gpfsug-discuss] GPFS Routers <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> Message-ID: Thanks. That example is simpler than I imagined. Question: If that was indeed your situation and you could afford it, why not just go totally with infiniband switching/routing? Are not the routers just a hack to connect Intel OPA to IB? Ref: https://community.mellanox.com/docs/DOC-2384#jive_content_id_Network_Topology_Design -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Sep 20 16:07:12 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 20 Sep 2016 17:07:12 +0200 Subject: [gpfsug-discuss] CES and nfs pseudo root Message-ID: <20160920150712.2v73hsf7pzrqb3g4@ics.muni.cz> Hello, ganesha allows to specify pseudo root for each export using Pseudo="path". mmnfs export sets pseudo path the same as export dir, e.g., I want to export /mnt/nfs, Pseudo is set to '/mnt/nfs' as well. Can I set somehow Pseudo to '/'? -- Luk?? Hejtm?nek From stef.coene at docum.org Tue Sep 20 18:42:57 2016 From: stef.coene at docum.org (Stef Coene) Date: Tue, 20 Sep 2016 19:42:57 +0200 Subject: [gpfsug-discuss] Ubuntu client Message-ID: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Hi, I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. But I also need 2 clients on Ubuntu 14.04. I installed the GPFS client on the Ubuntu server and used mmbuildgpl to build the required kernel modules. ssh keys are exchanged between GPFS servers and the client. But I can't add the node: [root at gpfs01 ~]# mmaddnode -N client1 Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 mmremote: The CCR environment could not be initialized on node client1. mmaddnode: The CCR environment could not be initialized on node client1. mmaddnode: mmaddnode quitting. None of the specified nodes are valid. mmaddnode: Command failed. Examine previous error messages to determine cause. I don't see any error in /var/mmfs on client and server. What can I try to debug this error? Stef From stef.coene at docum.org Tue Sep 20 18:47:47 2016 From: stef.coene at docum.org (Stef Coene) Date: Tue, 20 Sep 2016 19:47:47 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: <3727524d-aa94-a09e-ebf7-a5d4e1c6f301@docum.org> On 09/20/2016 07:42 PM, Stef Coene wrote: > Hi, > > I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. > But I also need 2 clients on Ubuntu 14.04. > I installed the GPFS client on the Ubuntu server and used mmbuildgpl to > build the required kernel modules. > ssh keys are exchanged between GPFS servers and the client. > > But I can't add the node: > [root at gpfs01 ~]# mmaddnode -N client1 > Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 > mmremote: The CCR environment could not be initialized on node client1. > mmaddnode: The CCR environment could not be initialized on node client1. > mmaddnode: mmaddnode quitting. None of the specified nodes are valid. > mmaddnode: Command failed. Examine previous error messages to determine > cause. > > I don't see any error in /var/mmfs on client and server. > > What can I try to debug this error? Pfff, problem solved. I tailed the logs in /var/adm/ras and found out there was a type in /etc/hosts so the hostname of the client was unresolvable. Stef From YARD at il.ibm.com Tue Sep 20 20:03:39 2016 From: YARD at il.ibm.com (Yaron Daniel) Date: Tue, 20 Sep 2016 22:03:39 +0300 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: Hi Check that kernel symbols are installed too Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Stef Coene To: gpfsug main discussion list Date: 09/20/2016 08:43 PM Subject: [gpfsug-discuss] Ubuntu client Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. But I also need 2 clients on Ubuntu 14.04. I installed the GPFS client on the Ubuntu server and used mmbuildgpl to build the required kernel modules. ssh keys are exchanged between GPFS servers and the client. But I can't add the node: [root at gpfs01 ~]# mmaddnode -N client1 Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 mmremote: The CCR environment could not be initialized on node client1. mmaddnode: The CCR environment could not be initialized on node client1. mmaddnode: mmaddnode quitting. None of the specified nodes are valid. mmaddnode: Command failed. Examine previous error messages to determine cause. I don't see any error in /var/mmfs on client and server. What can I try to debug this error? Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Wed Sep 21 04:35:57 2016 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 21 Sep 2016 05:35:57 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From stef.coene at docum.org Wed Sep 21 07:03:05 2016 From: stef.coene at docum.org (Stef Coene) Date: Wed, 21 Sep 2016 08:03:05 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: <01a37d7a-b5ef-cb3e-5ccb-d5f942df6487@docum.org> On 09/21/2016 05:35 AM, Olaf Weiser wrote: > CCR issues are often related to DNS issues, so check, that you Ubuntu > nodes can resolve the existing nodes accordingly and vise versa > in one line: .. all nodes must be resolvable on every node It was a type in the hostname and /etc/hosts. So problem solved. Stef From xhejtman at ics.muni.cz Wed Sep 21 20:09:32 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 21 Sep 2016 21:09:32 +0200 Subject: [gpfsug-discuss] CES NFS with Kerberos Message-ID: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> Hello, does nfs server (ganesha) work for someone with Kerberos authentication? I got random permission denied: :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s dd: error writing ?testf?: Permission denied 51949+0 records in 51948+0 records out 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s dd: error writing ?testf?: Permission denied 30076+0 records in 30075+0 records out 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s dd: error writing ?testf?: Permission denied 89837+0 records in 89836+0 records out 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s It seems that it is a bug in ganesha: http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 but it is still not resolved. -- Luk?? Hejtm?nek From Greg.Lehmann at csiro.au Wed Sep 21 23:34:09 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 21 Sep 2016 22:34:09 +0000 Subject: [gpfsug-discuss] CES NFS with Kerberos In-Reply-To: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> References: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> Message-ID: <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> It may not be NFS. Check your GPFS logs too. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lukas Hejtmanek Sent: Thursday, 22 September 2016 5:10 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] CES NFS with Kerberos Hello, does nfs server (ganesha) work for someone with Kerberos authentication? I got random permission denied: :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s dd: error writing ?testf?: Permission denied 51949+0 records in 51948+0 records out 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s dd: error writing ?testf?: Permission denied 30076+0 records in 30075+0 records out 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s dd: error writing ?testf?: Permission denied 89837+0 records in 89836+0 records out 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s It seems that it is a bug in ganesha: http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 but it is still not resolved. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From xhejtman at ics.muni.cz Thu Sep 22 09:25:09 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 22 Sep 2016 10:25:09 +0200 Subject: [gpfsug-discuss] CES NFS with Kerberos In-Reply-To: <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> References: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> Message-ID: <20160922082509.rc53tseeovjnixtz@ics.muni.cz> Hello, thanks, I do not see any error in GPFS logs. The link, I posted below is not related to GPFS at all, it seems that it is bug in ganesha. On Wed, Sep 21, 2016 at 10:34:09PM +0000, Greg.Lehmann at csiro.au wrote: > It may not be NFS. Check your GPFS logs too. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lukas Hejtmanek > Sent: Thursday, 22 September 2016 5:10 AM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] CES NFS with Kerberos > > Hello, > > does nfs server (ganesha) work for someone with Kerberos authentication? > > I got random permission denied: > :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s > dd: error writing ?testf?: Permission denied > 51949+0 records in > 51948+0 records out > 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s > dd: error writing ?testf?: Permission denied > 30076+0 records in > 30075+0 records out > 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s > dd: error writing ?testf?: Permission denied > 89837+0 records in > 89836+0 records out > 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s > > It seems that it is a bug in ganesha: > http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 > > but it is still not resolved. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek From stef.coene at docum.org Thu Sep 22 19:36:48 2016 From: stef.coene at docum.org (Stef Coene) Date: Thu, 22 Sep 2016 20:36:48 +0200 Subject: [gpfsug-discuss] Blocksize Message-ID: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef From eric.wonderley at vt.edu Thu Sep 22 20:07:30 2016 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 22 Sep 2016 15:07:30 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: It defaults to 4k: mmlsfs testbs8M -i flag value description ------------------- ------------------------ ----------------------------------- -i 4096 Inode size in bytes I think you can make as small as 512b. Gpfs will store very small files in the inode. Typically you want your average file size to be your blocksize and your filesystem has one blocksize and one inodesize. On Thu, Sep 22, 2016 at 2:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Sep 22 20:19:00 2016 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 22 Sep 2016 19:19:00 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <9DA9EC7A281AC7428A9618AFDC49049958EFBB06@CIO-KRC-D1MBX02.osuad.osu.edu> This is a great idea. However there are quite a few other things to consider: -max file count? If you need say a couple of billion files, this will affect things. -wish to store small files in the system pool in late model SS/GPFS? -encryption? No data will be stored in the system pool so large blocks for small file storage in system is pointless. -system pool replication? -HDD vs SSD for system pool? -xxD or array tuning recommendations from your vendor? -streaming vs random IO? Do you have a single dedicated app that has performance like xxx? -probably more I can't think of off the top of my head. etc etc Ed ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Stef Coene [stef.coene at docum.org] Sent: Thursday, September 22, 2016 2:36 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Blocksize Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Thu Sep 22 20:25:03 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 22 Sep 2016 21:25:03 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Thu Sep 22 20:29:43 2016 From: stef.coene at docum.org (Stef Coene) Date: Thu, 22 Sep 2016 21:29:43 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef From luis.bolinches at fi.ibm.com Thu Sep 22 20:37:02 2016 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 22 Sep 2016 19:37:02 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: , <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 22 21:02:24 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 22 Sep 2016 20:02:24 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: We are down to our last few places, so if you do intend to attend, I encourage you to register now! Simon ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Secretary GPFS UG [secretary at gpfsug.org] Sent: 15 September 2016 09:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 22 21:02:24 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 22 Sep 2016 20:02:24 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: We are down to our last few places, so if you do intend to attend, I encourage you to register now! Simon ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Secretary GPFS UG [secretary at gpfsug.org] Sent: 15 September 2016 09:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 22 21:25:10 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 22 Sep 2016 16:25:10 -0400 Subject: [gpfsug-discuss] Blocksize and space and performance for Metadata, release 4.2.x In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: There have been a few changes over the years that may invalidate some of the old advice about metadata and disk allocations there for. These have been phased in over the last few years, I am discussing the present situation for release 4.2.x 1) Inode size. Used to be 512. Now you can set the inodesize at mmcrfs time. Defaults to 4096. 2) Data in inode. If it fits, then the inode holds the data. Since a 512 byte inode still works, you can have more than 3.5KB of data in a 4KB inode. 3) Extended Attributes in Inode. Again, if it fits... Extended attributes used to be stored in a separate file of metadata. So extended attributes performance is way better than the old days. 4) (small) Directories in Inode. If it fits, the inode of a directory can hold the directory entries. That gives you about 2x performance on directory reads, for smallish directories. 5) Big directory blocks. Directories used to use a maximum of 32KB per block, potentially wasting a lot of space and yielding poor performance for large directories. Now directory blocks are the lesser of metadata-blocksize and 256KB. 6) Big directories are shrinkable. Used to be directories would grow in 32KB chunks but never shrink. Yup, even an almost(?) "empty" directory would remain the size the directory had to be at its lifetime maximum. That means just a few remaining entries could be "sprinkled" over many directory blocks. (See also 5.) But now directories autoshrink to avoid wasteful sparsity. Last I looked, the implementation just stopped short of "pushing" tiny directories back into the inode. But a huge directory can be shrunk down to a single (meta)data block. (See --compact in the docs.) --marc of GPFS -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Thu Sep 22 21:49:32 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 22 Sep 2016 13:49:32 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: The current (V4.2+) levels of code support bigger directory block sizes, so it's no longer an issue with something like 1M metadata block size. In fact, there isn't a whole lot of difference between 256K and 1M metadata block sizes, either would work fine. There isn't really a downside in selecting a different block size for metadata though. Inode size (mmcrfs -i option) is orthogonal to the metadata block size selection. We do strongly recommend using 4K inodes to anyone. There's the obvious downside of needing more metadata storage for inodes, but the advantages are significant. yuri From: Jan-Frode Myklebust To: gpfsug main discussion list , Date: 09/22/2016 12:25 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. ? -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Fri Sep 23 02:48:44 2016 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 23 Sep 2016 01:48:44 +0000 Subject: [gpfsug-discuss] Learn a new cluster Message-ID: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> What commands would you run to learn all you need to know about a cluster you?ve never seen before? Captain Obvious (me) says: mmlscluster mmlsconfig mmlsnode mmlsnsd mmlsfs all What others? Mark R. Bush | Solutions Architect This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 23 02:50:52 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 22 Sep 2016 21:50:52 -0400 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> Message-ID: <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> Perhaps a gpfs.snap? This could tell you a *lot* about a cluster. -Aaron On 9/22/16 9:48 PM, Mark.Bush at siriuscom.com wrote: > What commands would you run to learn all you need to know about a > cluster you?ve never seen before? > > Captain Obvious (me) says: > > mmlscluster > > mmlsconfig > > mmlsnode > > mmlsnsd > > mmlsfs all > > > > What others? > > > > > > Mark R. Bush | Solutions Architect > > > > This message (including any attachments) is intended only for the use of > the individual or entity to which it is addressed and may contain > information that is non-public, proprietary, privileged, confidential, > and exempt from disclosure under applicable law. If you are not the > intended recipient, you are hereby notified that any use, dissemination, > distribution, or copying of this communication is strictly prohibited. > This message may be viewed by parties at Sirius Computer Solutions other > than those named in the message header. This message does not contain an > official representation of Sirius Computer Solutions. If you have > received this communication in error, notify Sirius Computer Solutions > immediately and (i) destroy this message if a facsimile or (ii) delete > this message immediately if this is an electronic communication. Thank you. > > Sirius Computer Solutions > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Greg.Lehmann at csiro.au Fri Sep 23 02:53:14 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 23 Sep 2016 01:53:14 +0000 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> Message-ID: <40b22b40d6ed4e38be115e9f6ae8d48d@exch1-cdc.nexus.csiro.au> Nice question. I?d also look at the non-GPFS settings IBM recommend in various places like the FAQ for things like ssh, network, etc. The importance of these is variable depending on cluster size/network configuration etc. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark.Bush at siriuscom.com Sent: Friday, 23 September 2016 11:49 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Learn a new cluster What commands would you run to learn all you need to know about a cluster you?ve never seen before? Captain Obvious (me) says: mmlscluster mmlsconfig mmlsnode mmlsnsd mmlsfs all What others? Mark R. Bush | Solutions Architect This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Sep 23 17:31:59 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 23 Sep 2016 12:31:59 -0400 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> Message-ID: <078081B8-E50E-46BE-B3AC-4C1DB6D963E1@ulmer.org> This was going to be my exact suggestion. My short to-learn list includes learn how to look inside a gpfs.snap for what I want to know. I?ve found the ability to do this with other snapshot bundles very useful in the past (for example I?ve used snap on AIX rather than my own scripts in some cases). Do be aware the gpfs.snap (and actually most ?create a bundle for support? commands on most platforms) are a little heavy. Liberty, -- Stephen > On Sep 22, 2016, at 9:50 PM, Aaron Knister wrote: > > Perhaps a gpfs.snap? This could tell you a *lot* about a cluster. > > -Aaron > > On 9/22/16 9:48 PM, Mark.Bush at siriuscom.com wrote: >> What commands would you run to learn all you need to know about a >> cluster you?ve never seen before? >> >> Captain Obvious (me) says: >> >> mmlscluster >> >> mmlsconfig >> >> mmlsnode >> >> mmlsnsd >> >> mmlsfs all >> >> >> >> What others? >> >> >> >> >> >> Mark R. Bush | Solutions Architect >> >> >> >> This message (including any attachments) is intended only for the use of >> the individual or entity to which it is addressed and may contain >> information that is non-public, proprietary, privileged, confidential, >> and exempt from disclosure under applicable law. If you are not the >> intended recipient, you are hereby notified that any use, dissemination, >> distribution, or copying of this communication is strictly prohibited. >> This message may be viewed by parties at Sirius Computer Solutions other >> than those named in the message header. This message does not contain an >> official representation of Sirius Computer Solutions. If you have >> received this communication in error, notify Sirius Computer Solutions >> immediately and (i) destroy this message if a facsimile or (ii) delete >> this message immediately if this is an electronic communication. Thank you. >> >> Sirius Computer Solutions > >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Sep 23 20:16:06 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 23 Sep 2016 15:16:06 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 23 22:35:12 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 23 Sep 2016 14:35:12 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stef.coene at docum.org Fri Sep 23 23:34:49 2016 From: stef.coene at docum.org (Stef Coene) Date: Sat, 24 Sep 2016 00:34:49 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> On 09/22/2016 08:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... I have read the replies and at the end, this is what we will do: Since the back-end storage will be V5000 with a default stripe size of 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is the best choice for block size. So 2 MB block size for data is the best choice. Since the block size for metadata is not that important in the latest releases, we will also go for 2 MB block size for metadata. Inode size will be left at the default: 4 KB. Stef From mimarsh2 at vt.edu Sat Sep 24 02:21:30 2016 From: mimarsh2 at vt.edu (Brian Marshall) Date: Fri, 23 Sep 2016 21:21:30 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> Message-ID: To keep this great chain going: If my metadata is on FLASH, would having a smaller blocksize for the system pool (metadata only) be helpful. My filesystem blocksize is 8MB On Fri, Sep 23, 2016 at 6:34 PM, Stef Coene wrote: > On 09/22/2016 08:36 PM, Stef Coene wrote: > >> Hi, >> >> Is it needed to specify a different blocksize for the system pool that >> holds the metadata? >> >> IBM recommends a 1 MB blocksize for the file system. >> But I wonder a smaller blocksize (256 KB or so) for metadata is a good >> idea or not... >> > I have read the replies and at the end, this is what we will do: > Since the back-end storage will be V5000 with a default stripe size of > 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is > the best choice for block size. > So 2 MB block size for data is the best choice. > > Since the block size for metadata is not that important in the latest > releases, we will also go for 2 MB block size for metadata. > > Inode size will be left at the default: 4 KB. > > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Sat Sep 24 05:07:02 2016 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Sat, 24 Sep 2016 04:07:02 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Not pendant but correct I flip there it is 1/32 -- Cheers > On 23 Sep 2016, at 22.16, Stephen Ulmer wrote: > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. > > -- > Stephen > > > >> On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: >> >> Hi >> >> My 2 cents. >> >> Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) >> >> About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. >> >> No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) >> >> POSIX >> find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out >> GPFS >> cd /usr/lpp/mmfs/samples/ilm >> gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile >> ./mmfind /gpfs/shared -ls -type f > find_ls_files.out >> CONVERT to CSV >> >> POSIX >> cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv >> GPFS >> cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv >> LOAD in octave >> >> FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); >> Clean the second column (OPTIONAL as the next clean up will do the same) >> >> FILESIZE(:,[2]) = []; >> If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) >> >> FILESIZE(FILESIZE<=3584) =[]; >> If we are not we need to clean the 0 size files >> >> FILESIZE(FILESIZE==0) =[]; >> Median >> >> FILESIZEMEDIAN = int32 (median (FILESIZE)) >> Mean >> >> FILESIZEMEAN = int32 (mean (FILESIZE)) >> Variance >> >> int32 (var (FILESIZE)) >> iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. >> >> int32 (iqr (FILESIZE)) >> Standard deviation >> >> >> For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. >> >> >> >> -- >> Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations >> >> Luis Bolinches >> Lab Services >> http://www-03.ibm.com/systems/services/labservices/ >> >> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland >> Phone: +358 503112585 >> >> "If you continually give you will continually have." Anonymous >> >> >> ----- Original message ----- >> From: Stef Coene >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> To: gpfsug main discussion list >> Cc: >> Subject: Re: [gpfsug-discuss] Blocksize >> Date: Thu, Sep 22, 2016 10:30 PM >> >> On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: >> > It defaults to 4k: >> > mmlsfs testbs8M -i >> > flag value description >> > ------------------- ------------------------ >> > ----------------------------------- >> > -i 4096 Inode size in bytes >> > >> > I think you can make as small as 512b. Gpfs will store very small >> > files in the inode. >> > >> > Typically you want your average file size to be your blocksize and your >> > filesystem has one blocksize and one inodesize. >> >> The files are not small, but around 20 MB on average. >> So I calculated with IBM that a 1 MB or 2 MB block size is best. >> >> But I'm not sure if it's better to use a smaller block size for the >> metadata. >> >> The file system is not that large (400 TB) and will hold backup data >> from CommVault. >> >> >> Stef >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> Ellei edell? ole toisin mainittu: / Unless stated otherwise above: >> Oy IBM Finland Ab >> PL 265, 00101 Helsinki, Finland >> Business ID, Y-tunnus: 0195876-3 >> Registered in Finland >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sat Sep 24 15:18:38 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sat, 24 Sep 2016 14:18:38 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sat Sep 24 17:18:11 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 24 Sep 2016 12:18:11 -0400 Subject: [gpfsug-discuss] Blocksize and MetaData Blocksizes - FORGET the old advice In-Reply-To: References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Metadata is inodes, directories, indirect blocks (indices). Spectrum Scale (GPFS) Version 4.1 introduced significant improvements to the data structures used to represent directories. Larger inodes supporting data and extended attributes in the inode are other significant relatively recent improvements. Now small directories are stored in the inode, while for large directories blocks can be bigger than 32MB, and any and all directory blocks that are smaller than the metadata-blocksize, are allocated just like "fragments" - so directories are now space efficient. SO MUCH SO, that THE OLD ADVICE, about using smallish blocksizes for metadata, GOES "OUT THE WINDOW". Period. FORGET most of what you thought you knew about "best" or "optimal" metadata-blocksize. The new advice is, as Sven wrote: Use a blocksize that optimizes IO transfer efficiency and speed. This is true for BOTH data and metadata. Now, IF you have system pool set up as metadata only AND system pool is on devices that have a different "optimal" block size than your other pools, THEN, it may make sense to use two different blocksizes, one for data and another for metadata. For example, maybe you have massively striped RAID or RAID-LIKE (GSS or ESS)) storage for huge files - so maybe 8MB is a good blocksize for that. But maybe you have your metadata on SSD devices and maybe 1MB is the "best" blocksize for that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sat Sep 24 18:31:37 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 24 Sep 2016 13:31:37 -0400 Subject: [gpfsug-discuss] Blocksize - consider IO transfer efficiency above your other prejudices In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: (I can answer your basic questions, Sven has more experience with tuning very large file systems, so perhaps he will have more to say...) 1. Inodes are packed into the file of inodes. (There is one file of all the inodes in a filesystem). If you have metadata-blocksize 1MB you will have 256 of 4KB inodes per block. Forget about sub-blocks when it comes to the file of inodes. 2. IF a file's data fits in its inode, then migrating that file from one pool to another just changes the preferred pool name in the inode. No data movement. Should the file later "grow" to require a data block, that data block will be allocated from whatever pool is named in the inode at that time. See the email I posted earlier today. Basically: FORGET what you thought you knew about optimal metadata blocksize (perhaps based on how you thought metadata was laid out on disk) and just stick to optimal IO transfer blocksizes. Yes, there may be contrived scenarios or even a few real live special cases, but those would be few and far between. Try following the newer general, easier, rule and see how well it works. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/24/2016 10:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Sat Sep 24 19:16:49 2016 From: stef.coene at docum.org (Stef Coene) Date: Sat, 24 Sep 2016 20:16:49 +0200 Subject: [gpfsug-discuss] Maximum NSD size Message-ID: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> Hi, When formatting the NDS for a new file system, I noticed a warning about a maximum size: Formatting file system ... Disks up to size 8.8 TB can be added to storage pool system. Disks up to size 9.0 TB can be added to storage pool V5000. I searched the docs, but I couldn't find any reference regarding the maximum size of NSDs? Stef From oehmes at gmail.com Sun Sep 25 17:25:40 2016 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 25 Sep 2016 16:25:40 +0000 Subject: [gpfsug-discuss] Maximum NSD size In-Reply-To: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> References: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> Message-ID: the limit you see above is NOT the max NSD limit for Scale/GPFS, its rather the limit of the NSD size you can add to this Filesystems pool. depending on which version of code you are running, we limit the maximum size of a NSD that can be added to a pool so you don't have mixtures of lets say 1 TB and 100 TB disks in one pool as this will negatively affect performance. in older versions we where more restrictive than in newer versions. Sven On Sat, Sep 24, 2016 at 11:16 AM Stef Coene wrote: > Hi, > > When formatting the NDS for a new file system, I noticed a warning about > a maximum size: > > Formatting file system ... > Disks up to size 8.8 TB can be added to storage pool system. > Disks up to size 9.0 TB can be added to storage pool V5000. > > I searched the docs, but I couldn't find any reference regarding the > maximum size of NSDs? > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Sun Sep 25 18:11:12 2016 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 25 Sep 2016 17:11:12 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: well, its not that easy and there is no perfect answer here. so lets start with some data points that might help decide: inodes, directory blocks, allocation maps for data as well as metadata don't follow the same restrictions as data 'fragments' or subblocks, means they are not bond to the 1/32 of the blocksize. they rather get organized on calculated sized blocks which can be very small (significant smaller than 1/32th) or close to the max of the blocksize for a single object. therefore the space waste concern doesn't really apply here. policy scans loves larger blocks as the blocks will be randomly scattered across the NSD's and therefore larger contiguous blocks for inode scan will perform significantly faster on larger metadata blocksizes than on smaller (assuming this is disk, with SSD's this doesn't matter that much) so for disk based systems it is advantageous to use larger blocks , for SSD based its less of an issue. you shouldn't choose on the other hand too large blocks even for disk drive based systems as there is one catch to all this. small updates on metadata typically end up writing the whole metadata block e.g. 256k for a directory block which now need to be destaged and read back from another node changing the same block. hope this helps. Sven On Sat, Sep 24, 2016 at 7:18 AM Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu> wrote: > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 > MB and am very interested in learning the rationale behind this as I am > currently looking at all aspects of our current GPFS configuration and the > possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool > and the default size of an inode is 4K (which we would do, since we have > recently discovered that even on our scratch filesystem we have a bazillion > files that are 4K or smaller and could therefore have their data stored in > the inode, right?), then why would you set the metadata block size to > anything larger than 128K when a sub-block is 1/32nd of a block? I.e., > with a 1 MB block size for metadata wouldn?t you be wasting a massive > amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. Then I > have two other dataOnly pools that are spinning disk. One is for ?regular? > access and the other is the ?capacity? pool ? i.e. a pool of slower storage > where we move files with large access times. I have a policy that says > something like ?move all files with an access time > 6 months to the > capacity pool.? Of those bazillion files less than 4K in size that are > fitting in the inode currently, probably half a bazillion () of them > would be subject to that rule. Will they get moved to the spinning disk > capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > > your metadata block size these days should be 1 MB and there are only very > few workloads for which you should run with a filesystem blocksize below 1 > MB. so if you don't know exactly what to pick, 1 MB is a good starting > point. > the general rule still applies that your filesystem blocksize (metadata or > data pool) should match your raid controller (or GNR vdisk) stripe size of > the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB and > therefore the blocksize of this pool should be 1 MB. i see many customers > in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or > above and your performance will be significant impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block size > (which strengt > > > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------ > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the > block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about > metadata block size. You can specify that the system pool have a different > block size from the rest of the filesystem, providing that it ONLY holds > metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without > some counter-indication), I would think that we?d want to use a metadata > block size of 4K*32=128K. This is independent of the regular block size, > which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I > don?t know what it is. I?d be happy to be corrected about this if it?s out > of whack. > > -- > Stephen > > > > On Sep 22, 2016, at 3:37 PM, Luis Bolinches < > *luis.bolinches at fi.ibm.com* > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave there. > GPFS uses sublocks where 1/16th of the BS can be allocated to different > files, so the "waste" is much less than you think on 1MB and you get the > throughput and less structures of much more data blocks. > > No* warranty at all* but I try to do this when the BS talk comes > in: (might need some clean up it could not be last note but you get the > idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the > same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a lower > number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything could not > manage on a 64GB RAM Power box. Most of the times it is enough with my > laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / > Salutations > > Luis Bolinches > Lab Services > *http://www-03.ibm.com/systems/services/labservices/* > > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene <*stef.coene at docum.org* > > Sent by: *gpfsug-discuss-bounces at spectrumscale.org* > > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very > small > > files in the inode. > > > > Typically you want your average file size to be your blocksize > and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alandhae at gmx.de Mon Sep 26 08:53:48 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Mon, 26 Sep 2016 09:53:48 +0200 (CEST) Subject: [gpfsug-discuss] File-Access Reporting Message-ID: hello all GPFS 'ehmm Spectrum Scale experts out there, we are using GPFS as a Filesystem for a new Data Application. They have defined the need to get reports about: Transfer volume [or file access]: by user, ..., by service, by product type ... at least on a daily basis. they need a report about: fileopen, fileclose, or requestEndTime, requestDuration, fileProductName [path and filename], dataSize. userId. I could think of, using sysstat (sar) for getting some of the numbers, but not being sure, if the numbers we will be receiving are correct. Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From alandhae at gmx.de Mon Sep 26 13:12:18 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Mon, 26 Sep 2016 14:12:18 +0200 (CEST) Subject: [gpfsug-discuss] File_heat for GPFS File Systems Questions over Questions ... Message-ID: Hello GPFS experts, customer wanting a report about the usage of the usage including file_heat in a large Filesystem. The report should be taken every month. mmchconfig fileHeatLossPercent=10,fileHeatPeriodMinutes=30240 -i fileHeatPeriodMinutes=30240 equals to 21 days. I#m wondering about the behavior of fileHeatLossPercent. - If it is set to 10, will file_heat decrease from 1 to 0 in 10 steps? - Or does file_heat have an asymptotic behavior, and heat 0 will never be reached? Anyways the results will be similar ;-) latter taking longer. We want to achieve following file lists: - File_Heat > 50% -> rather hot data - File_Heat 50% < x < 20 -> lukewarm data - File_Heat 20% <= x <= 0% -> ice cold data We will have to work on the limits between the File_Heat classes, depending on customers wishes. Are there better parameter settings for achieving this? Do any scripts/programs exist for analyzing the file_heat data? We have observed when taking policy runs on a large GPFS file system, the meta data performance significantly dropped, until job was finished. It took about 15 minutes on a 880 TB with 150 Mio entries GPFS file system. How is the behavior, when file_heat is being switched on? Do all files in the GPFS have the same temperature? Thanks for your help Ciao Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From makaplan at us.ibm.com Mon Sep 26 16:11:52 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 26 Sep 2016 11:11:52 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: fileHeatLossPercent=10, fileHeatPeriodMinutes=1440 means any file that has not been accessed for 1440 minutes (24 hours = 1 day) will lose 10% of its Heat. So if it's heat was X at noon today, tomorrow 0.90 X, the next day 0.81X, on the k'th day (.90)**k * X. After 63 fileHeatPeriods, we always round down and compute file heat as 0.0. The computation (in floating point with some approximations) is done "on demand" based on a heat value stored in the Inode the last time the unix access "atime" and the current time. So the cost of maintaining FILE_HEAT for a file is some bit twiddling, but only when the file is accessed and the atime would be updated in the inode anyway. File heat increases by approximately 1.0 each time the entire file is read from disk. This is done proportionately so if you read in half of the blocks the increase is 0.5. If you read all the blocks twice FROM DISK the file heat is increased by 2. And so on. But only IOPs are charged. If you repeatedly do posix read()s but the data is in cache, no heat is added. The easiest way to observe FILE_HEAT is with the mmapplypolicy directory -I test -L 2 -P fileheatrule.policy RULE 'fileheatrule' LIST 'hot' SHOW('Heat=' || varchar(FILE_HEAT)) /* in file fileheatfule.policy */ Because policy reads metadata from inodes as stored on disk, when experimenting/testing you may need to mmfsctl fs suspend-write; mmfsctl fs resume to see results immediately. From: Andreas Landh?u?er To: gpfsug-discuss at spectrumscale.org Date: 09/26/2016 08:12 AM Subject: [gpfsug-discuss] File_heat for GPFS File Systems Questions over Questions ... Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello GPFS experts, customer wanting a report about the usage of the usage including file_heat in a large Filesystem. The report should be taken every month. mmchconfig fileHeatLossPercent=10,fileHeatPeriodMinutes=30240 -i fileHeatPeriodMinutes=30240 equals to 21 days. I#m wondering about the behavior of fileHeatLossPercent. - If it is set to 10, will file_heat decrease from 1 to 0 in 10 steps? - Or does file_heat have an asymptotic behavior, and heat 0 will never be reached? Anyways the results will be similar ;-) latter taking longer. We want to achieve following file lists: - File_Heat > 50% -> rather hot data - File_Heat 50% < x < 20 -> lukewarm data - File_Heat 20% <= x <= 0% -> ice cold data We will have to work on the limits between the File_Heat classes, depending on customers wishes. Are there better parameter settings for achieving this? Do any scripts/programs exist for analyzing the file_heat data? We have observed when taking policy runs on a large GPFS file system, the meta data performance significantly dropped, until job was finished. It took about 15 minutes on a 880 TB with 150 Mio entries GPFS file system. How is the behavior, when file_heat is being switched on? Do all files in the GPFS have the same temperature? Thanks for your help Ciao Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 26 19:18:15 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 26 Sep 2016 11:18:15 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ulmer at ulmer.org Mon Sep 26 20:01:56 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Mon, 26 Sep 2016 15:01:56 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > > It's important to understand the differences between different metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. > > The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. > the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. > > -- > Stephen > > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 26 20:29:18 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 26 Sep 2016 12:29:18 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Message-ID: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri From: Stephen Ulmer To: gpfsug main discussion list , Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme < oehmes at us.ibm.com> wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene < stef.coene at docum.org> Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From alandhae at gmx.de Tue Sep 27 10:04:02 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Tue, 27 Sep 2016 11:04:02 +0200 (CEST) Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: On Mon, 26 Sep 2016, Marc A Kaplan wrote: Marc, thanks for your explanation, > fileHeatLossPercent=10, fileHeatPeriodMinutes=1440 > > means any file that has not been accessed for 1440 minutes (24 hours = 1 > day) will lose 10% of its Heat. > > So if it's heat was X at noon today, tomorrow 0.90 X, the next day 0.81X, > on the k'th day (.90)**k * X. > After 63 fileHeatPeriods, we always round down and compute file heat as > 0.0. > > The computation (in floating point with some approximations) is done "on > demand" based on a heat value stored in the Inode the last time the unix > access "atime" and the current time. So the cost of maintaining > FILE_HEAT for a file is some bit twiddling, but only when the file is > accessed and the atime would be updated in the inode anyway. > > File heat increases by approximately 1.0 each time the entire file is read > from disk. This is done proportionately so if you read in half of the > blocks the increase is 0.5. > If you read all the blocks twice FROM DISK the file heat is increased by > 2. And so on. But only IOPs are charged. If you repeatedly do posix > read()s but the data is in cache, no heat is added. with the above definition file heat >= 0.0 e.g. any positive floating point value is valid. I need to categorize the files into categories hot, warm, lukewarm and cold. How do I achieve this, since the maximum heat is varying and need to be defined every time when requesting the report. We are wishing to migrate data according to the heat onto different storage categories (expensive --> cheap devices) > The easiest way to observe FILE_HEAT is with the mmapplypolicy directory > -I test -L 2 -P fileheatrule.policy > > RULE 'fileheatrule' LIST 'hot' SHOW('Heat=' || varchar(FILE_HEAT)) /* in > file fileheatfule.policy */ > > Because policy reads metadata from inodes as stored on disk, when > experimenting/testing you may need to > > mmfsctl fs suspend-write; mmfsctl fs resume Doing this on a production file system, a valid change request need to be filed, and description of the risks for customers data and so on have to be defined (ITIL) ... Any help and ideas will be appreciated Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From makaplan at us.ibm.com Tue Sep 27 15:25:04 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 10:25:04 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: You asked ... "We are wishing to migrate data according to the heat onto different storage categories (expensive --> cheap devices)" We suggest a policy rule like this: Rule 'm' Migrate From Pool 'Expensive' To Pool 'Thrifty' Threshold(90,75) Weight(-FILE_HEAT) /* minus sign! */ Which you can interpret as: When The 'Expensive' pool is 90% or more full, Migrate the lowest heat (coldest!) files to pool 'Thrifty', until the occupancy of 'Expensive' has been reduced to 75%. The concepts of Threshold and Weight have been in the produce since the MIGRATE rule was introduced. Another concept we introduced at the same time as FILE_HEAT was GROUP POOL. We've had little feedback and very few questions about this, so either it works great or is not being used much. (Maybe both are true ;-) ) GROUP POOL migration is documented in the Information Lifecycle Management chapter along with the other elements of the policy rules. In the 4.2.1 doc we suggest you can "repack" several pools with one GROUP POOL rule and one MIGRATE rule like this: You can ?repack? a group pool by WEIGHT. Migrate files of higher weight to preferred disk pools by specifying a group pool as both the source and the target of a MIGRATE rule. rule ?grpdef? GROUP POOL ?gpool? IS ?ssd? LIMIT(90) THEN ?fast? LIMIT(85) THEN ?sata? rule ?repack? MIGRATE FROM POOL ?gpool? TO POOL ?gpool? WEIGHT(FILE_HEAT) This should rank all the files in the three pools from hottest to coldest, and migrate them as necessary (if feasible) so that 'ssd' is up to 90% full of the hottest, 'fast' is up to 85% full of the next most hot, and the coolest files will be migrated to 'sata'. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 27 18:02:45 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 27 Sep 2016 17:02:45 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Message-ID: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Yuri / Sven / anyone else who wants to jump in, First off, thank you very much for your answers. I?d like to follow up with a couple of more questions. 1) Let?s assume that our overarching goal in configuring the block size for metadata is performance from the user perspective ? i.e. how fast is an ?ls -l? on my directory? Space savings aren?t important, and how long policy scans or other ?administrative? type tasks take is not nearly as important as that directory listing. Does that change the recommended metadata block size? 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional HPC use for those two) and /data (project space). Our storage arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 mirror, and two hot spare drives. The RAID 1 mirrors are for /home, the RAID 6 LUNs are for /scratch or /data. /home has tons of small files - so small that a 64K block size is currently used. /scratch and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. If you could ?start all over? with the same hardware being the only restriction, would you: a) merge /scratch and /data into one filesystem but keep /home separate since the LUN sizes are so very different, or b) merge all three into one filesystem and use storage pools so that /home is just a separate pool within the one filesystem? And if you chose this option would you assign different block sizes to the pools? Again, I?m asking these questions because I may have the opportunity to effectively ?start all over? and want to make sure I?m doing things as optimally as possible. Thanks? Kevin On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: From: Stephen Ulmer > To: gpfsug main discussion list >, Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: * There are items OTHER THAN INODES stored as metadata in GPFS. * These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. * Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. * If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. * We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. * Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list >, Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 27 18:16:52 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 13:16:52 -0400 Subject: [gpfsug-discuss] Blocksize, yea, inode size! In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: inode size will be a crucial choice in the scenario you describe. Consider the conflict: A large inode can hold a complete file or a complete directory. But the bigger the inode size, the less that fit in any given block size -- so when you have to read several inodes ... more IO, less likely that inodes you want are in the same block. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Tue Sep 27 18:23:34 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Tue, 27 Sep 2016 10:23:34 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: On 09/27/2016 10:02 AM, Buterbaugh, Kevin L wrote: > 1) Let?s assume that our overarching goal in configuring the block size > for metadata is performance from the user perspective ? i.e. how fast is > an ?ls -l? on my directory? Space savings aren?t important, and how > long policy scans or other ?administrative? type tasks take is not > nearly as important as that directory listing. Does that change the > recommended metadata block size? You need to put your metadata on SSDs. Make your SSDs the only members in your 'system' pool and put your other devices into another pool, and make that pool 'dataOnly'. If your SSDs are large enough to also hold some data, that's great; I typically do a migration policy to copy files smaller than filesystem block size (or definitely smaller than sub-block size) to the SSDs. Also, files smaller than 4k will usually fit into the inode (if you are using the 4k inode size). I have a system where the SSDs are regularly doing 6-7k IOPS for metadata stuff. If those same 7k IOPS were spread out over the slow data LUNs... which only have like 100 IOPS per 8+2P LUN... I'd be consuming 700 disks just for metadata IOPS. -- Alex Chekholko chekh at stanford.edu From kevindjo at us.ibm.com Tue Sep 27 18:33:29 2016 From: kevindjo at us.ibm.com (Kevin D Johnson) Date: Tue, 27 Sep 2016 17:33:29 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From alandhae at gmx.de Tue Sep 27 19:04:06 2016 From: alandhae at gmx.de (=?UTF-8?Q?Andreas_Landh=c3=a4u=c3=9fer?=) Date: Tue, 27 Sep 2016 20:04:06 +0200 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: as far as I understand, if a file gets hot again, there is no rule for putting the file back into a faster storage device? We would like having something like a storage elevator depending on the fileheat. In our setup, customer likes to migrate/move data even when the the threshold is not hit, just because it's cold and the price of the storage is less. On 27.09.2016 16:25, Marc A Kaplan wrote: > > You asked ... "We are wishing to migrate data according to the heat > onto different > storage categories (expensive --> cheap devices)" > > > We suggest a policy rule like this: > > Rule 'm' Migrate From Pool 'Expensive' To Pool 'Thrifty' > Threshold(90,75) Weight(-FILE_HEAT) /* minus sign! */ > > > Which you can interpret as: > > When The 'Expensive' pool is 90% or more full, Migrate the lowest heat > (coldest!) files to pool 'Thrifty', until > the occupancy of 'Expensive' has been reduced to 75%. > > The concepts of Threshold and Weight have been in the produce since > the MIGRATE rule was introduced. > > Another concept we introduced at the same time as FILE_HEAT was GROUP > POOL. We've had little feedback and very > few questions about this, so either it works great or is not being > used much. (Maybe both are true ;-) ) > > GROUP POOL migration is documented in the Information Lifecycle > Management chapter along with the other elements of the policy rules. > > In the 4.2.1 doc we suggest you can "repack" several pools with one > GROUP POOL rule and one MIGRATE rule like this: > > You can ?repack? a group pool by *WEIGHT*. Migrate files of higher > weight to preferred disk pools > by specifying a group pool as both the source and the target of a > *MIGRATE *rule. > > rule ?grpdef? GROUP POOL ?gpool? IS ?ssd? LIMIT(90) THEN ?fast? > LIMIT(85) THEN ?sata? > rule ?repack? MIGRATE FROM POOL ?gpool? TO POOL ?gpool? WEIGHT(FILE_HEAT) > > > This should rank all the files in the three pools from hottest to > coldest, and migrate them > as necessary (if feasible) so that 'ssd' is up to 90% full of the > hottest, 'fast' is up to 85% full of the next > most hot, and the coolest files will be migrated to 'sata'. > > > > -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Sep 27 19:12:19 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 27 Sep 2016 18:12:19 +0000 Subject: [gpfsug-discuss] File_heat for GPFS File Systems Message-ID: <0217AC60-11F0-4CEB-AE91-22D25E4649DC@nuance.com> Sure, if you use a policy to migrate between two tiers, it will move files up or down based on heat. Something like this (flas and disk pools): rule grpdef GROUP POOL gpool IS flash LIMIT(75) THEN Disk rule repack MIGRATE FROM POOL gpool TO POOL gpool WEIGHT(FILE_HEAT) Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid 507-269-0413 From: on behalf of Andreas Landh?u?er Reply-To: gpfsug main discussion list Date: Tuesday, September 27, 2016 at 1:04 PM To: Marc A Kaplan , gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] File_heat for GPFS File Systems as far as I understand, if a file gets hot again, there is no rule for putting the file back into a faster storage device? -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Tue Sep 27 19:26:46 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 27 Sep 2016 11:26:46 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > To: gpfsug main discussion list , > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Tue Sep 27 19:51:50 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Tue, 27 Sep 2016 11:51:50 -0700 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> On 09/27/2016 11:04 AM, Andreas Landh?u?er wrote: > if a file gets hot again, there is no rule for putting the file back > into a faster storage device? The file will get moved when you run the policy again. You can run the policy as often as you like. There is also a way to use a GPFS hook to trigger policy run. Check 'mmaddcallback' But I think you have to be careful and think through the complexity. e.g. load spikes and pool fills up and your callback kicks in and starts a migration which increases the I/O load further, etc... Regards, -- Alex Chekholko chekh at stanford.edu From makaplan at us.ibm.com Tue Sep 27 20:27:47 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 15:27:47 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: Read about GROUP POOL - you can call as often as you like to "repack" the files into several pools from hot to cold. Of course, there is a cost to running mmapplypolicy... So maybe you'd just run it once every day or so... -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Sep 27 20:38:16 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 27 Sep 2016 21:38:16 +0200 Subject: [gpfsug-discuss] Samba via CES Message-ID: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Hello, does CES offer high availability of SMB? I.e., does used samba server provide cluster wide persistent handles? Or failover from node to node is currently not supported without the client disruption? -- Luk?? Hejtm?nek From erich at uw.edu Tue Sep 27 21:56:20 2016 From: erich at uw.edu (Eric Horst) Date: Tue, 27 Sep 2016 13:56:20 -0700 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> References: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> Message-ID: >> >> if a file gets hot again, there is no rule for putting the file back >> into a faster storage device? > > > The file will get moved when you run the policy again. You can run the > policy as often as you like. I think its worth stating clearly that if a file is in the Thrifty slow pool and a user opens and reads/writes the file there is nothing that moves this file to a different tier. A policy run is the only action that relocates files. So if you apply the policy daily and over the course of the day users access many cold files, the performance accessing those cold files may not be ideal until the next day when they are repacked by heat. A file is not automatically moved to the fast tier on access read or write. I mention this because this aspect of tiering was not immediately clear from the docs when I was a neophyte GPFS admin and I had to learn by observation. It is easy for one to make an assumption that it is a more dynamic tiering system than it is. -Eric -- Eric Horst University of Washington From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 27 22:21:23 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 27 Sep 2016 21:21:23 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Sep 27 22:36:37 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 27 Sep 2016 14:36:37 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Message-ID: When a CES node fails, protocol clients have to reconnect to one of the remaining nodes. Samba in CES does not support persistent handles. This is indicated in the documentation: http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits "Only mandatory SMB3 protocol features are supported. " Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org Date: 09/27/2016 12:38 PM Subject: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, does CES offer high availability of SMB? I.e., does used samba server provide cluster wide persistent handles? Or failover from node to node is currently not supported without the client disruption? -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From xhejtman at ics.muni.cz Tue Sep 27 22:42:57 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 27 Sep 2016 23:42:57 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Message-ID: <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> On Tue, Sep 27, 2016 at 02:36:37PM -0700, Christof Schmitt wrote: > When a CES node fails, protocol clients have to reconnect to one of the > remaining nodes. > > Samba in CES does not support persistent handles. This is indicated in the > documentation: > > http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits > > "Only mandatory SMB3 protocol features are supported. " well, but in this case, HA feature is a bit pointless as node fail results in a client failure as well as reconnect does not seem to be automatic if there is on going traffic.. more precisely reconnect is automatic but without persistent handles, the client receives write protect error immediately. -- Luk?? Hejtm?nek From Greg.Lehmann at csiro.au Wed Sep 28 08:40:35 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 07:40:35 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: <428599f3d6cb47ebb74d05178eeba2b8@exch1-cdc.nexus.csiro.au> I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alandhae at gmx.de Wed Sep 28 10:13:55 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Wed, 28 Sep 2016 11:13:55 +0200 (CEST) Subject: [gpfsug-discuss] Proposal for dynamic heat assisted file tiering Message-ID: On Tue, 27 Sep 2016, Eric Horst wrote: Thanks Eric for the hint, shouldn't we as the users define a requirement for such a dynamic heat assisted file tiering option (DHAFTO). Keeping track which files have increased heat and triggering a transparent move to a faster tier. Since I haven't tested it on a GPFS FS, I would like to know about the performance penalties being observed, when frequently running the policies, just a rough estimate. Of course its depending on the speed of the Metadata disks (yes, we use different devices for Metadata) we are also running GPFS on various GSS Systems. IBM might also want bundling this option together with GSS/ESS hardware for better performance. Just my 2? Andreas >>> >>> if a file gets hot again, there is no rule for putting the file back >>> into a faster storage device? >> >> >> The file will get moved when you run the policy again. You can run the >> policy as often as you like. > > I think its worth stating clearly that if a file is in the Thrifty > slow pool and a user opens and reads/writes the file there is nothing > that moves this file to a different tier. A policy run is the only > action that relocates files. So if you apply the policy daily and over > the course of the day users access many cold files, the performance > accessing those cold files may not be ideal until the next day when > they are repacked by heat. A file is not automatically moved to the > fast tier on access read or write. I mention this because this aspect > of tiering was not immediately clear from the docs when I was a > neophyte GPFS admin and I had to learn by observation. It is easy for > one to make an assumption that it is a more dynamic tiering system > than it is. -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From Robert.Oesterlin at nuance.com Wed Sep 28 11:56:51 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 28 Sep 2016 10:56:51 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution Message-ID: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: on behalf of "Greg.Lehmann at csiro.au" Reply-To: gpfsug main discussion list Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Sep 28 14:45:14 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 28 Sep 2016 13:45:14 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: Greg, Not saying this is the right way to go, but I rolled my own. I wrote a very simple Perl script that essentially does the Perl equivalent of a find on my GPFS filesystems, then stat?s files and directories and writes the output to a text file. I run that one overnight or on the weekends. Takes about 6 hours to run across our 3 GPFS filesystems with metadata on SSDs. Then I?ve written a couple of different Perl scripts to analyze that data in different ways (the idea being that collecting the data is ?expensive? ? but once you?ve got it it?s cheap to analyze it in different ways). But the one I?ve been using for this project just breaks down the number of files and directories by size and age and produces a table. Rather than try to describe this, here?s sample output: For input file: gpfsFileInfo_20160915.txt <1 day | <1 wk | <1 mo | <2 mo | <3 mo | <4 mo | <5 mo | <6 mo | <1 yr | >1 year | Total Files <1 KB 29538 111364 458260 634398 150199 305715 4388443 93733 966618 3499535 10637803 <2 KB 9875 20580 119414 295167 35961 67761 80462 33688 269595 851641 1784144 <4 KB 9212 45282 168678 496796 27771 23887 105135 23161 259242 1163327 2322491 <8 KB 4992 29284 105836 303349 28341 20346 246430 28394 216061 1148459 2131492 <16 KB 3987 18391 92492 218639 20513 19698 675097 30976 190691 851533 2122017 <32 KB 4844 12479 50235 265222 24830 18789 1058433 18030 196729 1066287 2715878 <64 KB 6358 24259 29474 222134 17493 10744 1381445 11358 240528 1123540 3067333 <128 KB 6531 59107 206269 186213 71823 114235 1008724 36722 186357 845921 2721902 <256 KB 1995 17638 19355 436611 8505 7554 3582738 7519 249510 744885 5076310 <512 KB 20645 12401 24700 111463 5659 22132 1121269 10774 273010 725155 2327208 <1 MB 2681 6482 37447 58459 6998 14945 305108 5857 160360 386152 984489 <4 MB 4554 84551 23320 100407 6818 32833 129758 22774 210935 528458 1144408 <1 GB 56652 33538 99667 87778 24313 68372 118928 42554 251528 916493 1699823 <10 GB 1245 2482 4524 3184 1043 1794 2733 1694 8731 20462 47892 <100 GB 47 230 470 265 92 198 172 122 1276 2061 4933 >100 GB 2 3 12 1 14 4 5 1 37 165 244 Total TB: 6.49 13.22 30.56 18.00 10.22 15.69 19.87 12.48 73.47 187.44 Grand Total: 387.46 TB Everything other than the total space lines at the bottom are counts of number of files meeting that criteria. I?ve got another variation on the same script that we used when we were trying to determine how many old files we have and therefore how much data was an excellent candidate for moving to a slower, less expensive ?capacity? pool. I?m not sure how useful my tools would be to others ? I?m certainly not a professional programmer by any stretch of the imagination (and yes, I do hear those of you who are saying, ?Yeah, he?s barely a professional SysAdmin!? ). But others of you have been so helpful to me ? I?d like to try in some small way to help someone else. Kevin On Sep 28, 2016, at 5:56 AM, Oesterlin, Robert > wrote: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: > on behalf of "Greg.Lehmann at csiro.au" > Reply-To: gpfsug main discussion list > Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 15:34:05 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 10:34:05 -0400 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... SHOW rule) to gather the stats much faster. Should be minutes, not hours. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 16:23:12 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 11:23:12 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org><6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 16:33:29 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 11:33:29 -0400 Subject: [gpfsug-discuss] Proposal for dynamic heat assisted file tiering In-Reply-To: References: Message-ID: Suppose, we could "dynamically" change the pool assignment of a file. How/when would you have us do that? When will that generate unnecessary, "wasteful" IOPs? How do we know if/when/how often you will access a file in the future? This is similar to other classical caching policies, but there the choice is usually just which pages to flush from the cache when we need space ... The usual compromise is "LRU" but maybe some systems allow hints. When there are multiple pools, it seems more complicated, more degrees of freedom ... Would you be willing and able to write some new policy rules to provide directions to Spectrum Scale for dynamic tiering? What would that look like? Would it be worth the time and effort over what we have now? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Sep 28 19:13:35 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 28 Sep 2016 18:13:35 +0000 Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? Message-ID: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> What the largest file that will fit inside a 1K, 2K, or 4K inode? Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Wed Sep 28 21:18:55 2016 From: ewahl at osc.edu (Edward Wahl) Date: Wed, 28 Sep 2016 16:18:55 -0400 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: <20160928161855.1df32434@osc.edu> On Wed, 28 Sep 2016 10:34:05 -0400 Marc A Kaplan wrote: > Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... > SHOW rule) to gather the stats much faster. Should be minutes, not > hours. > I'll agree with the policy engine. Runs like a beast if you tune it a little for nodes and threads. Only takes a couple of minutes to collect info on over a hundred million files. Show where the data is now by pool and sort it by age with queries? quick hack up example. you could sort the mess on the front end fairly quickly. (use fileset or pool, etc as your storage needs) RULE '2yrold_files' LIST '2yrold_filelist.txt' SHOW (varchar(file_size) || ' ' || varchar(USER_ID) || ' ' || varchar(POOL_NAME)) WHERE DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) >= 730 AND DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) < 1095 don't forget to run the engine with the -I defer for this kind of list/show policy. Ed -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From christof.schmitt at us.ibm.com Wed Sep 28 21:33:45 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 28 Sep 2016 13:33:45 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: The client has to reconnect, open the file again and reissue request that have not been completed. Without persistent handles, the main risk is that another client can step in and access the same file in the meantime. With persistent handles, access from other clients would be prevented for a defined amount of time. Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug main discussion list Date: 09/27/2016 02:43 PM Subject: Re: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org On Tue, Sep 27, 2016 at 02:36:37PM -0700, Christof Schmitt wrote: > When a CES node fails, protocol clients have to reconnect to one of the > remaining nodes. > > Samba in CES does not support persistent handles. This is indicated in the > documentation: > > http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits > > "Only mandatory SMB3 protocol features are supported. " well, but in this case, HA feature is a bit pointless as node fail results in a client failure as well as reconnect does not seem to be automatic if there is on going traffic.. more precisely reconnect is automatic but without persistent handles, the client receives write protect error immediately. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bbanister at jumptrading.com Wed Sep 28 21:56:47 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 28 Sep 2016 20:56:47 +0000 Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? In-Reply-To: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> References: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB0633CA80@CHI-EXCHANGEW1.w2k.jumptrading.com> I think the guideline for 4K inodes is roughly 3.5KB depending on use of extended attributes, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Wednesday, September 28, 2016 1:14 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? What the largest file that will fit inside a 1K, 2K, or 4K inode? Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Sep 28 23:03:36 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 29 Sep 2016 00:03:36 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> On Wed, Sep 28, 2016 at 01:33:45PM -0700, Christof Schmitt wrote: > The client has to reconnect, open the file again and reissue request that > have not been completed. Without persistent handles, the main risk is that > another client can step in and access the same file in the meantime. With > persistent handles, access from other clients would be prevented for a > defined amount of time. well, I guess I cannot reconfigure the client so that reissuing request is done by OS and not rised up to the user? E.g., if user runs video encoding directly to Samba share and encoding runs for several hours, reissuing request, i.e., restart encoding, is not exactly what user accepts. -- Luk?? Hejtm?nek From abeattie at au1.ibm.com Wed Sep 28 23:25:01 2016 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 28 Sep 2016 22:25:01 +0000 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> References: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz>, <20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Wed Sep 28 23:49:31 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 22:49:31 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: <2ed56fe8c9c34eb5a1da25800b2951e0@exch1-cdc.nexus.csiro.au> Kevin, Thanks for the offer of help. I am capable of writing my own, but it looks like the best approach is to use mmapplypolicy, something I had not thought of. This is precisely the reason I asked what looks like a silly question. You don?t know what you don?t know! The quality of content on this list has been exceptional of late! Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 11:45 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize - file size distribution Greg, Not saying this is the right way to go, but I rolled my own. I wrote a very simple Perl script that essentially does the Perl equivalent of a find on my GPFS filesystems, then stat?s files and directories and writes the output to a text file. I run that one overnight or on the weekends. Takes about 6 hours to run across our 3 GPFS filesystems with metadata on SSDs. Then I?ve written a couple of different Perl scripts to analyze that data in different ways (the idea being that collecting the data is ?expensive? ? but once you?ve got it it?s cheap to analyze it in different ways). But the one I?ve been using for this project just breaks down the number of files and directories by size and age and produces a table. Rather than try to describe this, here?s sample output: For input file: gpfsFileInfo_20160915.txt <1 day | <1 wk | <1 mo | <2 mo | <3 mo | <4 mo | <5 mo | <6 mo | <1 yr | >1 year | Total Files <1 KB 29538 111364 458260 634398 150199 305715 4388443 93733 966618 3499535 10637803 <2 KB 9875 20580 119414 295167 35961 67761 80462 33688 269595 851641 1784144 <4 KB 9212 45282 168678 496796 27771 23887 105135 23161 259242 1163327 2322491 <8 KB 4992 29284 105836 303349 28341 20346 246430 28394 216061 1148459 2131492 <16 KB 3987 18391 92492 218639 20513 19698 675097 30976 190691 851533 2122017 <32 KB 4844 12479 50235 265222 24830 18789 1058433 18030 196729 1066287 2715878 <64 KB 6358 24259 29474 222134 17493 10744 1381445 11358 240528 1123540 3067333 <128 KB 6531 59107 206269 186213 71823 114235 1008724 36722 186357 845921 2721902 <256 KB 1995 17638 19355 436611 8505 7554 3582738 7519 249510 744885 5076310 <512 KB 20645 12401 24700 111463 5659 22132 1121269 10774 273010 725155 2327208 <1 MB 2681 6482 37447 58459 6998 14945 305108 5857 160360 386152 984489 <4 MB 4554 84551 23320 100407 6818 32833 129758 22774 210935 528458 1144408 <1 GB 56652 33538 99667 87778 24313 68372 118928 42554 251528 916493 1699823 <10 GB 1245 2482 4524 3184 1043 1794 2733 1694 8731 20462 47892 <100 GB 47 230 470 265 92 198 172 122 1276 2061 4933 >100 GB 2 3 12 1 14 4 5 1 37 165 244 Total TB: 6.49 13.22 30.56 18.00 10.22 15.69 19.87 12.48 73.47 187.44 Grand Total: 387.46 TB Everything other than the total space lines at the bottom are counts of number of files meeting that criteria. I?ve got another variation on the same script that we used when we were trying to determine how many old files we have and therefore how much data was an excellent candidate for moving to a slower, less expensive ?capacity? pool. I?m not sure how useful my tools would be to others ? I?m certainly not a professional programmer by any stretch of the imagination (and yes, I do hear those of you who are saying, ?Yeah, he?s barely a professional SysAdmin!? ). But others of you have been so helpful to me ? I?d like to try in some small way to help someone else. Kevin On Sep 28, 2016, at 5:56 AM, Oesterlin, Robert > wrote: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: > on behalf of "Greg.Lehmann at csiro.au" > Reply-To: gpfsug main discussion list > Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Wed Sep 28 23:54:36 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 22:54:36 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org><6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: Are there any presentation available online that provide diagrams of the directory/file creation process and modifications in terms of how the blocks/inodes and indirect blocks etc are used. I would guess there are a few different cases that would need to be shown. This is the sort of thing that would great in a decent text book on GPFS (doesn't exist as far as I am aware.) Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Thursday, 29 September 2016 1:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Sep 28 23:58:15 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 29 Sep 2016 00:58:15 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz> On Wed, Sep 28, 2016 at 10:25:01PM +0000, Andrew Beattie wrote: > In that scenario, would you not be better off using a native Spectrum > Scale client installed on the workstation that the video editor is using > with a local mapped drive, rather than a SMB share? > ? > This would prevent this the scenario you have proposed occurring. indeed, it would be better, but why one would have CES at all? I would like to use CES but it seems that it is not quite ready yet for such a scenario. -- Luk?? Hejtm?nek From christof.schmitt at us.ibm.com Thu Sep 29 00:06:59 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 28 Sep 2016 16:06:59 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> Message-ID: The exact behavior depends on the client and the application. I would suggest explicit testing of the protocol failover if that is a concern. Samba does not support persistent handles, so that would be a completely new feature. There is some support available for durable handles which have weaker guarantees, and which are also disabled in CES Samba due to known issues in large deployments. In cases where SMB protocol failover becomes an issue and durable handles might help, that might be an approach to improve the failover behavior. Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug main discussion list Date: 09/28/2016 03:04 PM Subject: Re: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org On Wed, Sep 28, 2016 at 01:33:45PM -0700, Christof Schmitt wrote: > The client has to reconnect, open the file again and reissue request that > have not been completed. Without persistent handles, the main risk is that > another client can step in and access the same file in the meantime. With > persistent handles, access from other clients would be prevented for a > defined amount of time. well, I guess I cannot reconfigure the client so that reissuing request is done by OS and not rised up to the user? E.g., if user runs video encoding directly to Samba share and encoding runs for several hours, reissuing request, i.e., restart encoding, is not exactly what user accepts. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From abeattie at au1.ibm.com Thu Sep 29 00:37:25 2016 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 28 Sep 2016 23:37:25 +0000 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz> References: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz>, <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz><20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: An HTML attachment was scrubbed... URL: From aaron.knister at gmail.com Thu Sep 29 02:43:52 2016 From: aaron.knister at gmail.com (Aaron Knister) Date: Wed, 28 Sep 2016 21:43:52 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at us.ibm.com Thu Sep 29 03:28:03 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 28 Sep 2016 19:28:03 -0700 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Hi Aaron, the best way to express this 'need' is to vote and leave comments in the RFE's : this is an RFE for GNR as SW : http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090 everybody who wants this to be one should vote for it and leave comments on what they expect. Sven From: Aaron Knister To: gpfsug-discuss at spectrumscale.org Date: 09/28/2016 06:44 PM Subject: Re: [gpfsug-discuss] gpfs native raid Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From daniel.kidger at uk.ibm.com Thu Sep 29 10:04:03 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Sep 2016 09:04:03 +0000 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT1-graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From daniel.kidger at uk.ibm.com Thu Sep 29 10:25:59 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Sep 2016 09:25:59 +0000 Subject: [gpfsug-discuss] AFM cacheset mounting from the same GPFS cluster ? Message-ID: An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 29 16:03:08 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 29 Sep 2016 15:03:08 +0000 Subject: [gpfsug-discuss] Fwd: Blocksize References: Message-ID: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Resending from the right e-mail address... Begin forwarded message: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:36 AM CDT To: klb at accre.vanderbilt.edu You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at gpfsug-discuss-owner at spectrumscale.org. From: "Kevin L. Buterbaugh" > Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:29 AM CDT To: gpfsug main discussion list > Hi Marc and others, I understand ? I guess I did a poor job of wording my question, so I?ll try again. The IBM recommendation for metadata block size seems to be somewhere between 256K - 1 MB, depending on who responds to the question. If I were to hypothetically use a 256K metadata block size, does the ?1/32nd of a block? come into play like it does for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode size) minimum? And here?s a really off the wall question ? yesterday we were discussing the fact that there is now a single inode file. Historically, we have always used RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for metadata and then use GPFS replication on top of that. But given that there is a single inode file is that ?old way? of doing things still the right way? In other words, could we potentially be better off by using a couple of 8+2P RAID 6 LUNs? One potential downside of that would be that we would then only have two NSD servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN and splitting it up into multiple logical volumes (all that done on the storage array, of course) and then presenting those to GPFS as NSDs??? Or have I gone from merely asking stupid questions to Trump-level craziness???? ;-) Kevin On Sep 28, 2016, at 10:23 AM, Marc A Kaplan > wrote: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 29 16:32:47 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 29 Sep 2016 11:32:47 -0400 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: Frankly, I just don't "get" what it is you seem not to be "getting" - perhaps someone else who does "get" it can rephrase: FORGET about Subblocks when thinking about inodes being packed into the file of all inodes. Additional facts that may address some of the other concerns: I started working on GPFS at version 3.1 or so. AFAIK GPFS always had and has one file of inodes, "packed", with no wasted space between inodes. Period. Full Stop. RAID! Now we come to a mistake that I've seen made by more than a handful of customers! It is generally a mistake to use RAID with parity (such as classic RAID5) to store metadata. Why? Because metadata is often updated with "small writes" - for example suppose we have to update some fields in an inode, or an indirect block, or append a log record... For RAID with parity and large stripe sizes -- this means that updating just one disk sector can cost a full stripe read + writing the changed data and parity sectors. SO, if you want protection against storage failures for your metadata, use either RAID mirroring/replication and/or GPFS metadata replication. (belt and/or suspenders) (Arguments against relying solely on RAID mirroring: single enclosure/box failure (fire!), single hardware design (bugs or defects), single firmware/microcode(bugs.)) Yes, GPFS is part of "the cyber." We're making it stronger everyday. But it already is great. --marc From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/29/2016 11:03 AM Subject: [gpfsug-discuss] Fwd: Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Resending from the right e-mail address... Begin forwarded message: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:36 AM CDT To: klb at accre.vanderbilt.edu You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at gpfsug-discuss-owner at spectrumscale.org. From: "Kevin L. Buterbaugh" Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:29 AM CDT To: gpfsug main discussion list Hi Marc and others, I understand ? I guess I did a poor job of wording my question, so I?ll try again. The IBM recommendation for metadata block size seems to be somewhere between 256K - 1 MB, depending on who responds to the question. If I were to hypothetically use a 256K metadata block size, does the ?1/32nd of a block? come into play like it does for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode size) minimum? And here?s a really off the wall question ? yesterday we were discussing the fact that there is now a single inode file. Historically, we have always used RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for metadata and then use GPFS replication on top of that. But given that there is a single inode file is that ?old way? of doing things still the right way? In other words, could we potentially be better off by using a couple of 8+2P RAID 6 LUNs? One potential downside of that would be that we would then only have two NSD servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN and splitting it up into multiple logical volumes (all that done on the storage array, of course) and then presenting those to GPFS as NSDs??? Or have I gone from merely asking stupid questions to Trump-level craziness???? ;-) Kevin On Sep 28, 2016, at 10:23 AM, Marc A Kaplan wrote: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Sep 29 16:38:56 2016 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 29 Sep 2016 17:38:56 +0200 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Thu Sep 29 19:00:40 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 29 Sep 2016 11:00:40 -0700 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: > to the question. If I were to hypothetically use a 256K metadata > block size, does the ?1/32nd of a block? come into play like it does > for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing > *2* inodes (assuming 4K inode size) minimum? I think the point of confusion here is minimum allocation size vs minimum IO size -- those two are not one and the same. In fact in GPFS those are largely unrelated values. For low-level metadata files where multiple records are packed into the same block, it is possible to read/write either an individual record (such as an inode), or an entire block of records (which is what happens, for example, during inode copy-on-write). The minimum IO size in GPFS is 512 bytes. On a "4K-aligned" file system, GPFS vows to only do IOs in multiples of 4KiB. For data, GPFS tracks what portion of a given block is valid/dirty using an in-memory bitmap, and if 4K in the middle of a 16M block are modified, only 4K get written, not 16M (although this is more complicated for sparse file writes and appends, when some areas need to be zeroed out). For metadata writes, entire metadata objects are written, using the actual object size, rounded up to the nearest 512B or 4K boundary, as needed. So a single modified inode results in a single inode write, regardless of the metadata block size. If you have snapshots, and the inode being modified needs to be copied to the previous snapshot, and happens to be the first inode in the block that needs a COW, an entire block of inodes is copied to the latest snapshot, as an optimization. > And here?s a really off the wall question ? yesterday we were > discussing the fact that there is now a single inode file. > Historically, we have always used RAID 1 mirrors (first with > spinning disk, as of last fall now on SSD) for metadata and then use > GPFS replication on top of that. But given that there is a single > inode file is that ?old way? of doing things still the right way? > In other words, could we potentially be better off by using a couple > of 8+2P RAID 6 LUNs? The old way is also the modern way in this case. Using RAID1 LUNs for GPFS metadata is still the right approach. You don't want to use RAID erasure codes that trigger read-modify-write for small IOs, which are typical for metadata (unless your RAID array has so much cache as to make RMW a moot point). > One potential downside of that would be that we would then only have > two NSD servers serving up metadata, so we discussed the idea of > taking each RAID 6 LUN and splitting it up into multiple logical > volumes (all that done on the storage array, of course) and then > presenting those to GPFS as NSDs??? Like most performance questions, this one can ultimately only be answered definitively by running tests, but offhand I would suspect that the performance impact of RAID6, combined with extra contention for physical disks, is going to more than offset the benefits of using more NSD servers. Keep in mind that you aren't limited to 2 NSD servers per LUN. If you actually have the connectivity for more than 2 nodes on your RAID controller, GPFS allows up to 8 simultaneously active NSD servers per NSD. yuri > On Sep 28, 2016, at 10:23 AM, Marc A Kaplan wrote: > > OKAY, I'll say it again. inodes are PACKED into a single inode > file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. > There is no wasted space. > > (Of course if you have metadata replication = 2, then yes, double > that. And yes, there overhead for indirect blocks (indices), > allocation maps, etc, etc.) > > And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good > choice for your data distribution, to optimize packing of data and/ > or directories into inodes... > > Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... > > mmcrfs x2K -i 2048 > > [root at n2 charts]# mmlsfs x2K -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 2048 Inode size in bytes > > Works for me! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Fri Sep 30 06:43:53 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 29 Sep 2016 22:43:53 -0700 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: The issue of "GNR as software" is a pretty convoluted mixture of technical, business, and resource constraints issues. While some of the technical issues can be discussed here, obviously the other considerations cannot be discussed in a public forum. So you won't be able to get a complete understanding of the situation by discussing it here. > I understand the support concerns, but I naively thought that assuming > the hardware meets a basic set of requirements (e.g. redundant sas > paths, x type of drives) it would be fairly supportable with GNR. The > DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla > I thought. Setting business issues aside, this is more complicated on the technical level than one may think. At present, GNR requires a set of twin-tailed external disk enclosures. This is not a particularly exotic kind of hardware, but it turns out that this corner of the storage world is quite insular. GNR has a very close relationship with physical disk devices, much more so than regular GPFS. In an ideal world, SCSI and SES standards are supposed to provide a framework which would allow software like GNR to operate on an arbitrary disk enclosure. In the real world, the actual SES implementations on various enclosures that we've been dealing with are, well, peculiar. Apparently SES is one of those standards where vendors feel a lot of freedom in "re-interpreting" the standard, and since typically enclosures talk to a small set of RAID controllers, there aren't bad enough consequences to force vendors to be religious about SES standard compliance. Furthermore, the SAS fabric topology in configurations with an external disk enclosures is surprisingly complex, and that complexity predictably leads to complex failures which don't exist in simpler configurations. Thus far, every single one of the five enclosures we've had a chance to run GNR on required some adjustments, workarounds, hacks, etc. And the consequences of a misbehaving SAS fabric can be quite dire. There are various approaches to dealing with those complications, from running a massive 3rd party hardware qualification program to basically declaring any complications from an unknown enclosure to be someone else's problem (how would ZFS deal with a SCSI reset storm due to a bad SAS expander?), but there's much debate on what is the right path to take. Customer input/feedback is obviously very valuable in tilting such discussions in the right direction. yuri From: Aaron Knister To: gpfsug-discuss at spectrumscale.org, Date: 09/28/2016 06:44 PM Subject: Re: [gpfsug-discuss] gpfs native raid Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stef.coene at docum.org Fri Sep 30 14:03:01 2016 From: stef.coene at docum.org (Stef Coene) Date: Fri, 30 Sep 2016 15:03:01 +0200 Subject: [gpfsug-discuss] Toolkit Message-ID: Hi, When using the toolkit, all config data is stored in clusterdefinition.txt When you modify the cluster with mm* commands, the toolkit is unaware of these changes. Is it possible to recreate the clusterdefinition.txt based on the current configuration? Stef From matthew at ellexus.com Fri Sep 30 16:30:11 2016 From: matthew at ellexus.com (Matthew Harris) Date: Fri, 30 Sep 2016 16:30:11 +0100 Subject: [gpfsug-discuss] Introduction from Ellexus Message-ID: Hello everyone, Ellexus is the IO profiling company with tools for load balancing shared storage, solving IO performance issues and detecting rogue jobs that have bad IO patterns. We have a good number of customers who use Spectrum Scale so we do a lot of work to support it. We have clients and partners working across the HPC space including semiconductor, life sciences, oil and gas, automotive and finance. We're based in Cambridge, England. Some of you will have already met our CEO, Rosemary Francis. Looking forward to meeting you at SC16. Matthew Harris Account Manager & Business Development - Ellexus Ltd *www.ellexus.com * *Ellexus Ltd is a limited company registered in England & Wales* *Company registration no. 07166034* *Registered address: 198 High Street, Tonbridge, Kent TN9 1BE, UK* *Operating address: St John's Innovation Centre, Cowley Road, Cambridge CB4 0WS, UK* -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 30 21:56:29 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 30 Sep 2016 16:56:29 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: <2f59d32a-fc0f-3f03-dd95-3465611dc841@nasa.gov> Thanks, Yuri. Your replies are always quite enjoyable to read. I didn't realize SES was such a loosely interpreted standard, I just assumed it was fairly straightforward. We've got a number of JBODs here we manage via SES using the linux enclosure module (e.g. /sys/class/enclosure) and they seem to "just work" but we're not doing anything terribly advanced, mostly just turning on/off various status LEDs. I should clarify, the newer SAS enclosures I've encountered seem quite good, some of the older enclosures (e.g. in particular the Xyratex enclosure used by DDN in it's S2A units) were a real treat to interact with and didn't seem to follow the SES standard in spirit. I can certainly accept the complexity argument here. I think for our purposes a "reasonable level" of support would be all we're after. I'm not sure how ZFS would deal with a SCSI reset storm, I suspect the pool would just offline itself if enough paths seemed to disappear or timeout. If I could make GPFS work well with ZFS as the underlying storage target I would be quite happy. So far I have struggled to make it performant. GPFS seems to assume once a block device accepts the write that it's committed to stable storage. With ZFS ZVOL's this isn't the case by default. Making it the case (setting the sync=always paremter) causes a *massive* degradation in performance. If GPFS were to issue sync commands at appropriate intervals then I think we could make this work well. I'm not sure how to go about this, though, and given frequent enough scsi sync commands to a given lun its performance would likely degrade to the current state of zfs with sync=always. At any rate, we'll see how things go. Thanks again. -Aaron On 9/30/16 1:43 AM, Yuri L Volobuev wrote: > The issue of "GNR as software" is a pretty convoluted mixture of > technical, business, and resource constraints issues. While some of the > technical issues can be discussed here, obviously the other > considerations cannot be discussed in a public forum. So you won't be > able to get a complete understanding of the situation by discussing it here. > >> I understand the support concerns, but I naively thought that assuming >> the hardware meets a basic set of requirements (e.g. redundant sas >> paths, x type of drives) it would be fairly supportable with GNR. The >> DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla >> I thought. > > Setting business issues aside, this is more complicated on the technical > level than one may think. At present, GNR requires a set of twin-tailed > external disk enclosures. This is not a particularly exotic kind of > hardware, but it turns out that this corner of the storage world is > quite insular. GNR has a very close relationship with physical disk > devices, much more so than regular GPFS. In an ideal world, SCSI and > SES standards are supposed to provide a framework which would allow > software like GNR to operate on an arbitrary disk enclosure. In the > real world, the actual SES implementations on various enclosures that > we've been dealing with are, well, peculiar. Apparently SES is one of > those standards where vendors feel a lot of freedom in "re-interpreting" > the standard, and since typically enclosures talk to a small set of RAID > controllers, there aren't bad enough consequences to force vendors to be > religious about SES standard compliance. Furthermore, the SAS fabric > topology in configurations with an external disk enclosures is > surprisingly complex, and that complexity predictably leads to complex > failures which don't exist in simpler configurations. Thus far, every > single one of the five enclosures we've had a chance to run GNR on > required some adjustments, workarounds, hacks, etc. And the > consequences of a misbehaving SAS fabric can be quite dire. There are > various approaches to dealing with those complications, from running a > massive 3rd party hardware qualification program to basically declaring > any complications from an unknown enclosure to be someone else's problem > (how would ZFS deal with a SCSI reset storm due to a bad SAS expander?), > but there's much debate on what is the right path to take. Customer > input/feedback is obviously very valuable in tilting such discussions in > the right direction. > > yuri > > Inactive hide details for Aaron Knister ---09/28/2016 06:44:23 > PM---Thanks Everyone for your replies! (Quick disclaimer, these Aaron > Knister ---09/28/2016 06:44:23 PM---Thanks Everyone for your replies! > (Quick disclaimer, these opinions are my own, and not those of my > > From: Aaron Knister > To: gpfsug-discuss at spectrumscale.org, > Date: 09/28/2016 06:44 PM > Subject: Re: [gpfsug-discuss] gpfs native raid > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Thanks Everyone for your replies! (Quick disclaimer, these opinions are > my own, and not those of my employer or NASA). > > Not knowing what's coming at the NDA session, it seems to boil down to > "it ain't gonna happen" because of: > > - Perceived difficulty in supporting whatever creative hardware > solutions customers may throw at it. > > I understand the support concerns, but I naively thought that assuming > the hardware meets a basic set of requirements (e.g. redundant sas > paths, x type of drives) it would be fairly supportable with GNR. The > DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla > I thought. > > - IBM would like to monetize the product and compete with the likes of > DDN/Seagate > > This is admittedly a little disappointing. GPFS as long as I've known it > has been largely hardware vendor agnostic. To see even a slight shift > towards hardware vendor lockin and certain features only being supported > and available on IBM hardware is concerning. It's not like the software > itself is free. Perhaps GNR could be a paid add-on license for non-IBM > hardware? Just thinking out-loud. > > The big things I was looking to GNR for are: > > - end-to-end checksums > - implementing a software RAID layer on (in my case enterprise class) JBODs > > I can find a way to do the second thing, but the former I cannot. > Requiring IBM hardware to get end-to-end checksums is a huge red flag > for me. That's something Lustre will do today with ZFS on any hardware > ZFS will run on (and for free, I might add). I would think GNR being > openly available to customers would be important for GPFS to compete > with Lustre. Furthermore, I had opened an RFE (#84523) a while back to > implement checksumming of data for non-GNR environments. The RFE was > declined because essentially it would be too hard and it already exists > for GNR. Well, considering I don't have a GNR environment, and hardware > vendor lock in is something many sites are not interested in, that's > somewhat of a problem. > > I really hope IBM reconsiders their stance on opening up GNR. The > current direction, while somewhat understandable, leaves a really bad > taste in my mouth and is one of the (very few, in my opinion) features > Lustre has over GPFS. > > -Aaron > > > On 9/1/16 9:59 AM, Marc A Kaplan wrote: >> I've been told that it is a big leap to go from supporting GSS and ESS >> to allowing and supporting native raid for customers who may throw >> together "any" combination of hardware they might choose. >> >> In particular the GNR "disk hospital" functions... >> https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm >> will be tricky to support on umpteen different vendor boxes -- and keep >> in mind, those will be from IBM competitors! >> >> That said, ESS and GSS show that IBM has some good tech in this area and >> IBM has shown with the Spectrum Scale product (sans GNR) it can support >> just about any semi-reasonable hardware configuration and a good slew of >> OS versions and architectures... Heck I have a demo/test version of GPFS >> running on a 5 year old Thinkpad laptop.... And we have some GSSs in the >> lab... Not to mention Power hardware and mainframe System Z (think 360, >> 370, 290, Z) >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Thu Sep 1 00:40:13 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 31 Aug 2016 19:40:13 -0400 Subject: [gpfsug-discuss] Data Replication In-Reply-To: References: Message-ID: You can leave out the WHERE ... AND POOL_NAME LIKE 'deep' - that is redundant with the FROM POOL 'deep' clause. In fact at a slight additional overhead in mmapplypolicy processing due to begin checked a little later in the game, you can leave out MISC_ATTRIBUTES NOT LIKE '%2%' since the code is smart enough to not operate on files already marked as replicate(2). I believe mmapplypolicy .... -I yes means do any necessary data movement and/or replication "now" Alternatively you can say -I defer, which will leave the files "ill-replicated" and then fix them up with mmrestripefs later. The -I yes vs -I defer choice is the same as for mmchattr. Think of mmapplypolicy as a fast, parallel way to do find ... | xargs mmchattr ... Advert: see also samples/ilm/mmfind -- the latest version should have an -xargs option From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 08/31/2016 04:44 PM Subject: Re: [gpfsug-discuss] Data Replication Sent by: gpfsug-discuss-bounces at spectrumscale.org Assuming your DeepFlash pool is named "deep", something like the following should work: RULE 'deepreplicate' migrate from pool 'deep' to pool 'deep' replicate(2) where MISC_ATTRIBUTES NOT LIKE '%2%' and POOL_NAME LIKE 'deep' "mmapplypolicy gpfs0 -P replicate-policy.pol -I yes" and possibly "mmrestripefs gpfs0 -r" afterwards. -jf On Wed, Aug 31, 2016 at 8:01 PM, Brian Marshall wrote: Daniel, So here's my use case: I have a Sandisk IF150 (branded as DeepFlash recently) with 128TB of flash acting as a "fast tier" storage pool in our HPC scratch file system. Can I set the filesystem replication level to 1 then write a policy engine rule to send small and/or recent files to the IF150 with a replication of 2? Any other comments on the proposed usage strategy are helpful. Thank you, Brian Marshall On Wed, Aug 31, 2016 at 10:32 AM, Daniel Kidger wrote: The other 'Exception' is when a rule is used to convert a 1 way replicated file to 2 way, or when only one failure group is up due to HW problems. It that case the (re-replication) is done by whatever nodes are used for the rule or command-line, which may include an NSD server. Daniel IBM Spectrum Storage Software +44 (0)7818 522266 Sent from my iPad using IBM Verse On 30 Aug 2016, 19:53:31, mimarsh2 at vt.edu wrote: From: mimarsh2 at vt.edu To: gpfsug-discuss at spectrumscale.org Cc: Date: 30 Aug 2016 19:53:31 Subject: Re: [gpfsug-discuss] Data Replication Thanks. This confirms the numbers that I am seeing. Brian On Tue, Aug 30, 2016 at 2:50 PM, Laurence Horrocks-Barlow < laurence at qsplace.co.uk> wrote: Its the client that does all the synchronous replication, this way the cluster is able to scale as the clients do the leg work (so to speak). The somewhat "exception" is if a GPFS NSD server (or client with direct NSD) access uses a server bases protocol such as SMB, in this case the SMB server will do the replication as the SMB client doesn't know about GPFS or its replication; essentially the SMB server is the GPFS client. -- Lauz On 30 August 2016 17:03:38 CEST, Bryan Banister wrote: The NSD Client handles the replication and will, as you stated, write one copy to one NSD (using the primary server for this NSD) and one to a different NSD in a different GPFS failure group (using quite likely, but not necessarily, a different NSD server that is the primary server for this alternate NSD). Cheers, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto: gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Brian Marshall Sent: Tuesday, August 30, 2016 9:59 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Data Replication All, If I setup a filesystem to have data replication of 2 (2 copies of data), does the data get replicated at the NSD Server or at the client? i.e. Does the client send 2 copies over the network or does the NSD Server get a single copy and then replicate on storage NSDs? I couldn't find a place in the docs that talked about this specific point. Thank you, Brian Marshall Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Sep 1 11:29:48 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 1 Sep 2016 10:29:48 +0000 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Aaron, GNR is a key differentiator for IBM's (and Lenovo's) Storage hardware appliance. ESS and GSS are otherwise commodity storage arrays connected to commodity NSD servers, albeit with a high degree of tuning and rigorous testing and validation. This competes with equivalent DDN and Seagate appliances as well other non s/w Raid offerings from other IBM partners. GNR only works for a small number of disk arrays and then only in certain configurations. GNR then might be thought of as 'firmware' for the hardware rather than part of a software defined products at is Spectrum Scale. If you beleive the viewpoint that hardware Raid 'is dead' then GNR will not be the only s/w Raid that will be used to underly Spectrum Scale. As well as vendor specific offerings from DDN, Seagate, etc. ZFS is likely to be a popular choice but is today not well understood or tested. This will change as more 3rd parties publish their experiences and tuning optimisations, and also as storage solution vendors bidding Spectrum Scale find they can't compete without a software Raid component in their offering. Disclaimer: the above are my own views and not necessarily an IBM official viewpoint. Daniel IBM Spectrum Storage Software +44 (0)7818 522266 Sent from my iPad using IBM Verse On 30 Aug 2016, 18:17:01, aaron.s.knister at nasa.gov wrote: From: aaron.s.knister at nasa.gov To: gpfsug-discuss at spectrumscale.org Cc: Date: 30 Aug 2016 18:17:01 Subject: Re: [gpfsug-discuss] gpfs native raid Thanks Christopher. I've tried GPFS on zvols a couple times and the write throughput I get is terrible because of the required sync=always parameter. Perhaps a couple of SSD's could help get the number up, though. -Aaron On 8/30/16 12:47 PM, Christopher Maestas wrote: > Interestingly enough, Spectrum Scale can run on zvols. Check out: > > http://files.gpfsug.org/presentations/2016/anl-june/LANL_GPFS_ZFS.pdf > > -cdm > > ------------------------------------------------------------------------ > On Aug 30, 2016, 9:17:05 AM, aaron.s.knister at nasa.gov wrote: > > From: aaron.s.knister at nasa.gov > To: gpfsug-discuss at spectrumscale.org > Cc: > Date: Aug 30, 2016 9:17:05 AM > Subject: [gpfsug-discuss] gpfs native raid > > Does anyone know if/when we might see gpfs native raid opened up for the > masses on non-IBM hardware? It's hard to answer the question of "why > can't GPFS do this? Lustre can" in regards to Lustre's integration with > ZFS and support for RAID on commodity hardware. > -Aaron > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discussUnless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Sep 1 12:22:47 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 1 Sep 2016 11:22:47 +0000 Subject: [gpfsug-discuss] Maximum value for data replication? In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bauer at cesnet.cz Thu Sep 1 14:30:23 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Thu, 1 Sep 2016 15:30:23 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks Message-ID: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Hello, I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the filesystem metadata from state: -m = 2 (default metadata replicas) - SATA disks (dataAndMetadata, failGroup=1) - SSDs (metadataOnly, failGroup=3) to the desired state: -m = 1 - SATA disks (dataOnly, failGroup=1) - SSDs (metadataOnly, failGroup=3) I have done the following steps in the following order: 1) change SATA disks to dataOnly (stanza file modifies the 'usage' attribute only): # mmchdisk fs1 change -F dataOnly_disks.stanza Attention: Disk parameters were changed. Use the mmrestripefs command with the -r option to relocate data and metadata. Verifying file system configuration information ... mmchdisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. 2) change default metadata replicas number 2->1 # mmchfs fs1 -m 1 3) run mmrestripefs as suggested by output of 1) # mmrestripefs fs1 -r Scanning file system metadata, phase 1 ... Error processing inodes. No space left on device mmrestripefs: Command failed. Examine previous error messages to determine cause. It is, however, still possible to create new files on the filesystem. When I return one of the SATA disks as a dataAndMetadata disk, the mmrestripefs command stops complaining about No space left on device. Both df and mmdf say that there is enough space both for data (SATA) and metadata (SSDs). Does anyone have an idea why is it complaining? Thanks, -- Miroslav Bauer -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From aaron.s.knister at nasa.gov Thu Sep 1 14:36:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 09:36:32 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: I must admit, I'm curious as to the reason you're dropping the replication factor from 2 down to 1. There are some serious advantages we've seen to having multiple metadata replicas, as far as error recovery is concerned. Could you paste an output of mmlsdisk for the filesystem? -Aaron On 9/1/16 9:30 AM, Miroslav Bauer wrote: > Hello, > > I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > filesystem metadata from state: > -m = 2 (default metadata replicas) > - SATA disks (dataAndMetadata, failGroup=1) > - SSDs (metadataOnly, failGroup=3) > to the desired state: > -m = 1 > - SATA disks (dataOnly, failGroup=1) > - SSDs (metadataOnly, failGroup=3) > > I have done the following steps in the following order: > 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > attribute only): > # mmchdisk fs1 change -F dataOnly_disks.stanza > Attention: Disk parameters were changed. > Use the mmrestripefs command with the -r option to relocate data and > metadata. > Verifying file system configuration information ... > mmchdisk: Propagating the cluster configuration data to all > affected nodes. This is an asynchronous process. > > 2) change default metadata replicas number 2->1 > # mmchfs fs1 -m 1 > > 3) run mmrestripefs as suggested by output of 1) > # mmrestripefs fs1 -r > Scanning file system metadata, phase 1 ... > Error processing inodes. > No space left on device > mmrestripefs: Command failed. Examine previous error messages to > determine cause. > > It is, however, still possible to create new files on the filesystem. > When I return one of the SATA disks as a dataAndMetadata disk, the > mmrestripefs > command stops complaining about No space left on device. Both df and mmdf > say that there is enough space both for data (SATA) and metadata (SSDs). > Does anyone have an idea why is it complaining? > > Thanks, > > -- > Miroslav Bauer > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Sep 1 14:39:17 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 09:39:17 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: By the way, I suspect the no space on device errors are because GPFS believes for some reason that it is unable to maintain the metadata replication factor of 2 that's likely set on all previously created inodes. On 9/1/16 9:36 AM, Aaron Knister wrote: > I must admit, I'm curious as to the reason you're dropping the > replication factor from 2 down to 1. There are some serious advantages > we've seen to having multiple metadata replicas, as far as error > recovery is concerned. > > Could you paste an output of mmlsdisk for the filesystem? > > -Aaron > > On 9/1/16 9:30 AM, Miroslav Bauer wrote: >> Hello, >> >> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >> filesystem metadata from state: >> -m = 2 (default metadata replicas) >> - SATA disks (dataAndMetadata, failGroup=1) >> - SSDs (metadataOnly, failGroup=3) >> to the desired state: >> -m = 1 >> - SATA disks (dataOnly, failGroup=1) >> - SSDs (metadataOnly, failGroup=3) >> >> I have done the following steps in the following order: >> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >> attribute only): >> # mmchdisk fs1 change -F dataOnly_disks.stanza >> Attention: Disk parameters were changed. >> Use the mmrestripefs command with the -r option to relocate data and >> metadata. >> Verifying file system configuration information ... >> mmchdisk: Propagating the cluster configuration data to all >> affected nodes. This is an asynchronous process. >> >> 2) change default metadata replicas number 2->1 >> # mmchfs fs1 -m 1 >> >> 3) run mmrestripefs as suggested by output of 1) >> # mmrestripefs fs1 -r >> Scanning file system metadata, phase 1 ... >> Error processing inodes. >> No space left on device >> mmrestripefs: Command failed. Examine previous error messages to >> determine cause. >> >> It is, however, still possible to create new files on the filesystem. >> When I return one of the SATA disks as a dataAndMetadata disk, the >> mmrestripefs >> command stops complaining about No space left on device. Both df and mmdf >> say that there is enough space both for data (SATA) and metadata (SSDs). >> Does anyone have an idea why is it complaining? >> >> Thanks, >> >> -- >> Miroslav Bauer >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan at buzzard.me.uk Thu Sep 1 14:49:11 2016 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 01 Sep 2016 14:49:11 +0100 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: <1472737751.25479.22.camel@buzzard.phy.strath.ac.uk> On Thu, 2016-09-01 at 09:39 -0400, Aaron Knister wrote: > By the way, I suspect the no space on device errors are because GPFS > believes for some reason that it is unable to maintain the metadata > replication factor of 2 that's likely set on all previously created inodes. > Hazarding a guess, but there is only one SSD NSD, so if all the metadata is going to go on SSD there is no point in replicating. It would also explain why it would believe it can't maintain the metadata replication factor. Though it could just be a simple metadata size is larger than the available SSD size. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From makaplan at us.ibm.com Thu Sep 1 14:59:28 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 1 Sep 2016 09:59:28 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: I've been told that it is a big leap to go from supporting GSS and ESS to allowing and supporting native raid for customers who may throw together "any" combination of hardware they might choose. In particular the GNR "disk hospital" functions... https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm will be tricky to support on umpteen different vendor boxes -- and keep in mind, those will be from IBM competitors! That said, ESS and GSS show that IBM has some good tech in this area and IBM has shown with the Spectrum Scale product (sans GNR) it can support just about any semi-reasonable hardware configuration and a good slew of OS versions and architectures... Heck I have a demo/test version of GPFS running on a 5 year old Thinkpad laptop.... And we have some GSSs in the lab... Not to mention Power hardware and mainframe System Z (think 360, 370, 290, Z) -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Sep 1 15:02:50 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 10:02:50 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Oh! I think you've already provided the info I was looking for :) I thought that failGroup=3 meant there were 3 failure groups within the SSDs. I suspect that's not at all what you meant and that actually is the failure group of all of those disks. That I think explains what's going on-- there's only one failure group's worth of metadata-capable disks available and as such GPFS can't place the 2nd replica for existing files. Here's what I would suggest: - Create at least 2 failure groups within the SSDs - Put the default metadata replication factor back to 2 - Run a restripefs -R to shuffle files around and restore the metadata replication factor of 2 to any files created while it was set to 1 If you're not interested in replication for metadata then perhaps all you need to do is the mmrestripefs -R. I think that should un-replicate the file from the SATA disks leaving the copy on the SSDs. Hope that helps. -Aaron On 9/1/16 9:39 AM, Aaron Knister wrote: > By the way, I suspect the no space on device errors are because GPFS > believes for some reason that it is unable to maintain the metadata > replication factor of 2 that's likely set on all previously created inodes. > > On 9/1/16 9:36 AM, Aaron Knister wrote: >> I must admit, I'm curious as to the reason you're dropping the >> replication factor from 2 down to 1. There are some serious advantages >> we've seen to having multiple metadata replicas, as far as error >> recovery is concerned. >> >> Could you paste an output of mmlsdisk for the filesystem? >> >> -Aaron >> >> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>> Hello, >>> >>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>> filesystem metadata from state: >>> -m = 2 (default metadata replicas) >>> - SATA disks (dataAndMetadata, failGroup=1) >>> - SSDs (metadataOnly, failGroup=3) >>> to the desired state: >>> -m = 1 >>> - SATA disks (dataOnly, failGroup=1) >>> - SSDs (metadataOnly, failGroup=3) >>> >>> I have done the following steps in the following order: >>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>> attribute only): >>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>> Attention: Disk parameters were changed. >>> Use the mmrestripefs command with the -r option to relocate data and >>> metadata. >>> Verifying file system configuration information ... >>> mmchdisk: Propagating the cluster configuration data to all >>> affected nodes. This is an asynchronous process. >>> >>> 2) change default metadata replicas number 2->1 >>> # mmchfs fs1 -m 1 >>> >>> 3) run mmrestripefs as suggested by output of 1) >>> # mmrestripefs fs1 -r >>> Scanning file system metadata, phase 1 ... >>> Error processing inodes. >>> No space left on device >>> mmrestripefs: Command failed. Examine previous error messages to >>> determine cause. >>> >>> It is, however, still possible to create new files on the filesystem. >>> When I return one of the SATA disks as a dataAndMetadata disk, the >>> mmrestripefs >>> command stops complaining about No space left on device. Both df and >>> mmdf >>> say that there is enough space both for data (SATA) and metadata (SSDs). >>> Does anyone have an idea why is it complaining? >>> >>> Thanks, >>> >>> -- >>> Miroslav Bauer >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Thu Sep 1 15:14:18 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 1 Sep 2016 10:14:18 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Message-ID: I believe the OP left out a step. I am not saying this is a good idea, but ... One must change the replication factors marked in each inode for each file... This could be done using an mmapplypolicy rule: RULE 'one' MIGRATE FROM POOL 'yourdatapool' TO POOL 'yourdatapool' REPLICATE(1,1) (repeat rule for each POOL you have) Put that (those) rules in a file and do a "one time" run like mmapplypolicy yourfilesystem -P /path/to/rule -N nodelist-to-do-this-work -g /filesystem/bigtemp -I defer Then try your restripe again. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From bauer at cesnet.cz Thu Sep 1 15:28:36 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Thu, 1 Sep 2016 16:28:36 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Message-ID: <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Yes, failure group id is exactly what I meant :). Unfortunately, mmrestripefs with -R behaves the same as with -r. I also believed that mmrestripefs -R is the correct tool for fixing the replication settings on inodes (according to manpages), but I will try possible solutions you and Marc suggested and let you know how it went. Thank you, -- Miroslav Bauer On 09/01/2016 04:02 PM, Aaron Knister wrote: > Oh! I think you've already provided the info I was looking for :) I > thought that failGroup=3 meant there were 3 failure groups within the > SSDs. I suspect that's not at all what you meant and that actually is > the failure group of all of those disks. That I think explains what's > going on-- there's only one failure group's worth of metadata-capable > disks available and as such GPFS can't place the 2nd replica for > existing files. > > Here's what I would suggest: > > - Create at least 2 failure groups within the SSDs > - Put the default metadata replication factor back to 2 > - Run a restripefs -R to shuffle files around and restore the metadata > replication factor of 2 to any files created while it was set to 1 > > If you're not interested in replication for metadata then perhaps all > you need to do is the mmrestripefs -R. I think that should > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > Hope that helps. > > -Aaron > > On 9/1/16 9:39 AM, Aaron Knister wrote: >> By the way, I suspect the no space on device errors are because GPFS >> believes for some reason that it is unable to maintain the metadata >> replication factor of 2 that's likely set on all previously created >> inodes. >> >> On 9/1/16 9:36 AM, Aaron Knister wrote: >>> I must admit, I'm curious as to the reason you're dropping the >>> replication factor from 2 down to 1. There are some serious advantages >>> we've seen to having multiple metadata replicas, as far as error >>> recovery is concerned. >>> >>> Could you paste an output of mmlsdisk for the filesystem? >>> >>> -Aaron >>> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>>> Hello, >>>> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>>> filesystem metadata from state: >>>> -m = 2 (default metadata replicas) >>>> - SATA disks (dataAndMetadata, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> to the desired state: >>>> -m = 1 >>>> - SATA disks (dataOnly, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> >>>> I have done the following steps in the following order: >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>>> attribute only): >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>>> Attention: Disk parameters were changed. >>>> Use the mmrestripefs command with the -r option to relocate data and >>>> metadata. >>>> Verifying file system configuration information ... >>>> mmchdisk: Propagating the cluster configuration data to all >>>> affected nodes. This is an asynchronous process. >>>> >>>> 2) change default metadata replicas number 2->1 >>>> # mmchfs fs1 -m 1 >>>> >>>> 3) run mmrestripefs as suggested by output of 1) >>>> # mmrestripefs fs1 -r >>>> Scanning file system metadata, phase 1 ... >>>> Error processing inodes. >>>> No space left on device >>>> mmrestripefs: Command failed. Examine previous error messages to >>>> determine cause. >>>> >>>> It is, however, still possible to create new files on the filesystem. >>>> When I return one of the SATA disks as a dataAndMetadata disk, the >>>> mmrestripefs >>>> command stops complaining about No space left on device. Both df and >>>> mmdf >>>> say that there is enough space both for data (SATA) and metadata >>>> (SSDs). >>>> Does anyone have an idea why is it complaining? >>>> >>>> Thanks, >>>> >>>> -- >>>> Miroslav Bauer >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From S.J.Thompson at bham.ac.uk Thu Sep 1 22:06:44 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 1 Sep 2016 21:06:44 +0000 Subject: [gpfsug-discuss] Maximum value for data replication? In-Reply-To: References: , , Message-ID: I have two protocol node in each of two data centres. So four protocol nodes in the cluster. Plus I also have a quorum vm which is lockstep/ha so guaranteed to survive in one of the data centres should we lose power. The protocol servers being protocol servers don't have access to the fibre channel storage. And we've seen ces do bad things when the storage cluster it is remotely mounting (and the ces root is on) fails/is under load etc. So the four full copies is about guaranteeing there are two full copies in both data centres. And remember this is only for the cesroot, so lock data for the ces ips, the smb registry I think as well. I was hoping that by making the cesroot in the protocol node cluster rather than a fileset on a remote mounted filesysyem, that it would fix the ces weirdness we see as it would become a local gpfs file system. I guess three copies would maybe work. But also in another cluster, we have been thinking about adding NVMe into NSD servers for metadata and system.log and so I can se there are cases there where having higher numbers of copies would be useful. Yes I take the point that more copies means more load for the client, but in these cases, we aren't thinking about gpfs as the fastest possible hpc file system, but for other infrastructure purposes, which is one of the ways the product seems to be moving. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Daniel Kidger [daniel.kidger at uk.ibm.com] Sent: 01 September 2016 12:22 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Maximum value for data replication? Simon, Hi. Can you explain why you would like a full copy of all blocks on all 4 NSD servers ? Is there a particular use case, and hence an interest from product development? Otherwise remember that with 4 NSD servers, with one failure group per (storage rich) NSD server, then all 4 disk arrays will be loaded equally, as new files will get written to any 3 (or 2 or 1) of the 4 failure groups. Also remember that as you add more replication then there is more network load on the gpfs client as it has to perform all the writes itself. Perhaps someone technical can comment on the logic that determines which '3' out of 4 failure groups, a particular block is written to. Daniel [/spectrum_storage-banne] [Spectrum Scale Logo] Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-07818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Steve Duersch Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: Re: [gpfsug-discuss] Maximum value for data replication? Date: Wed, Aug 31, 2016 1:45 PM >>Is there a maximum value for data replication in Spectrum Scale? The maximum value for replication is 3. Steve Duersch Spectrum Scale RAID 845-433-7902 IBM Poughkeepsie, New York [Inactive hide details for gpfsug-discuss-request---08/30/2016 07:25:24 PM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/30/2016 07:25:24 PM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/30/2016 07:25 PM Subject: gpfsug-discuss Digest, Vol 55, Issue 55 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Maximum value for data replication? (Simon Thompson (Research Computing - IT Services)) 2. greetings (Kevin D Johnson) 3. GPFS 3.5.0 on RHEL 6.8 (Lukas Hejtmanek) 4. Re: GPFS 3.5.0 on RHEL 6.8 (Kevin D Johnson) 5. Re: GPFS 3.5.0 on RHEL 6.8 (mark.bergman at uphs.upenn.edu) 6. Re: *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" (Lukas Hejtmanek) 7. Re: *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Tue, 30 Aug 2016 19:09:05 +0000 From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Maximum value for data replication? Message-ID: Content-Type: text/plain; charset="us-ascii" Is there a maximum value for data replication in Spectrum Scale? I have a number of nsd servers which have local storage and Id like each node to have a full copy of all the data in the file-system, say this value is 4, can I set replication to 4 for data and metadata and have each server have a full copy? These are protocol nodes and multi cluster mount another file system (yes I know not supported) and the cesroot is in the remote file system. On several occasions where GPFS has wibbled a bit, this has caused issues with ces locks, so I was thinking of moving the cesroot to a local filesysyem which is replicated on the local ssds in the protocol nodes. I.e. Its a generally quiet file system as its only ces cluster config. I assume if I stop protocols, rsync the data and then change to the new ces root, I should be able to get this working? Thanks Simon ------------------------------ Message: 2 Date: Tue, 30 Aug 2016 19:43:39 +0000 From: "Kevin D Johnson" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] greetings Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Tue, 30 Aug 2016 22:39:18 +0200 From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: <20160830203917.qptfgqvlmdbzu6wr at ics.muni.cz> Content-Type: text/plain; charset=iso-8859-2 Hello, does it work for anyone? As of kernel 2.6.32-642, GPFS 3.5.0 (including the latest patch 32) does start but does not mount and file system. The internal mount cmd gets stucked. -- Luk?? Hejtm?nek ------------------------------ Message: 4 Date: Tue, 30 Aug 2016 20:51:39 +0000 From: "Kevin D Johnson" To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 5 Date: Tue, 30 Aug 2016 17:07:21 -0400 From: mark.bergman at uphs.upenn.edu To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: <24437-1472591241.445832 at bR6O.TofS.917u> Content-Type: text/plain; charset="UTF-8" In the message dated: Tue, 30 Aug 2016 22:39:18 +0200, The pithy ruminations from Lukas Hejtmanek on <[gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8> were: => Hello, GPFS 3.5.0.[23..3-0] work for me under [CentOS|ScientificLinux] 6.8, but at kernel 2.6.32-573 and lower. I've found kernel bugs in blk_cloned_rq_check_limits() in later kernel revs that caused multipath errors, resulting in GPFS being unable to find all NSDs and mount the filesystem. I am not updating to a newer kernel until I'm certain this is resolved. I opened a bug with CentOS: https://bugs.centos.org/view.php?id=10997 and began an extended discussion with the (RH & SUSE) developers of that chunk of kernel code. I don't know if an upstream bug has been opened by RH, but see: https://patchwork.kernel.org/patch/9140337/ => => does it work for anyone? As of kernel 2.6.32-642, GPFS 3.5.0 (including the => latest patch 32) does start but does not mount and file system. The internal => mount cmd gets stucked. => => -- => Luk?? Hejtm?nek -- Mark Bergman voice: 215-746-4061 mark.bergman at uphs.upenn.edu fax: 215-614-0266 http://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania PGP Key: http://www.cbica.upenn.edu/sbia/bergman ------------------------------ Message: 6 Date: Wed, 31 Aug 2016 00:02:50 +0200 From: Lukas Hejtmanek To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" Message-ID: <20160830220250.yt6r7gvfq7rlvtcs at ics.muni.cz> Content-Type: text/plain; charset=iso-8859-2 Hello, On Mon, Aug 29, 2016 at 09:20:46AM +0200, Frank Kraemer wrote: > Find the paper here: > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Storage%20Manager/page/Petascale%20Data%20Protection thank you for the paper, I appreciate it. However, I wonder whether it could be extended a little. As it has the title Petascale Data Protection, I think that in Peta scale, you have to deal with millions (well rather hundreds of millions) of files you store in and this is something where TSM does not scale well. Could you give some hints: On the backup site: mmbackup takes ages for: a) scan (try to scan 500M files even in parallel) b) backup - what if 10 % of files get changed - backup process can be blocked several days as mmbackup cannot run in several instances on the same file system, so you have to wait until one run of mmbackup finishes. How long could it take at petascale? On the restore site: how can I restore e.g. 40 millions of file efficiently? dsmc restore '/path/*' runs into serious troubles after say 20M files (maybe wrong internal structures used), however, scanning 1000 more files takes several minutes resulting the dsmc restore never reaches that 40M files. using filelists the situation is even worse. I run dsmc restore -filelist with a filelist consisting of 2.4M files. Running for *two* days without restoring even a single file. dsmc is consuming 100 % CPU. So any hints addressing these issues with really large number of files would be even more appreciated. -- Luk?? Hejtm?nek ------------------------------ Message: 7 Date: Tue, 30 Aug 2016 16:24:59 -0700 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" Message-ID: Content-Type: text/plain; charset="utf-8" so lets start with some simple questions. when you say mmbackup takes ages, what version of gpfs code are you running ? how do you execute the mmbackup command ? exact parameters would be useful . what HW are you using for the metadata disks ? how much capacity (df -h) and how many inodes (df -i) do you have in the filesystem you try to backup ? sven On Tue, Aug 30, 2016 at 3:02 PM, Lukas Hejtmanek wrote: > Hello, > > On Mon, Aug 29, 2016 at 09:20:46AM +0200, Frank Kraemer wrote: > > Find the paper here: > > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/ > Tivoli%20Storage%20Manager/page/Petascale%20Data%20Protection > > thank you for the paper, I appreciate it. > > However, I wonder whether it could be extended a little. As it has the > title > Petascale Data Protection, I think that in Peta scale, you have to deal > with > millions (well rather hundreds of millions) of files you store in and this > is > something where TSM does not scale well. > > Could you give some hints: > > On the backup site: > mmbackup takes ages for: > a) scan (try to scan 500M files even in parallel) > b) backup - what if 10 % of files get changed - backup process can be > blocked > several days as mmbackup cannot run in several instances on the same file > system, so you have to wait until one run of mmbackup finishes. How long > could > it take at petascale? > > On the restore site: > how can I restore e.g. 40 millions of file efficiently? dsmc restore > '/path/*' > runs into serious troubles after say 20M files (maybe wrong internal > structures used), however, scanning 1000 more files takes several minutes > resulting the dsmc restore never reaches that 40M files. > > using filelists the situation is even worse. I run dsmc restore -filelist > with a filelist consisting of 2.4M files. Running for *two* days without > restoring even a single file. dsmc is consuming 100 % CPU. > > So any hints addressing these issues with really large number of files > would > be even more appreciated. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 55 ********************************************** _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif Type: image/gif Size: 105 bytes Desc: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif URL: From r.sobey at imperial.ac.uk Fri Sep 2 14:37:26 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 13:37:26 +0000 Subject: [gpfsug-discuss] CES node responding on system IP address Message-ID: Hi all, *Should* a CES node, 4.2.0 OR 4.2.1, be responding on its system IP address? The nodes in my cluster, seemingly randomly, either give me a list of shares, or prompt me to enter a username and password. For example, Start > Run \\cesnode.fqdn I get a prompt for a username and password. If I add the system IP into my hosts file and call it clustername.fqdn it responds normally i.e. no prompt for username or password. Should I be worried about the inconsistencies here? Richard Sobey Storage Area Network (SAN) Analyst Technical Operations, ICT Imperial College London South Kensington 403, City & Guilds Building London SW7 2AZ Tel: +44 (0)20 7594 6915 Email: r.sobey at imperial.ac.uk http://www.imperial.ac.uk/admin-services/ict/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Sep 2 16:10:59 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 15:10:59 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: I?ve verified the upgrade has fixed this issue so thanks again. However I?ve noticed that stopping SMB doesn?t trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Sep 2 16:15:30 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 2 Sep 2016 15:15:30 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: Should it? If you were running nfs and smb, would you necessarily want to fail the ip over? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] Sent: 02 September 2016 16:10 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? I?ve verified the upgrade has fixed this issue so thanks again. However I?ve noticed that stopping SMB doesn?t trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** From r.sobey at imperial.ac.uk Fri Sep 2 16:23:28 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 15:23:28 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) Sent: 02 September 2016 16:16 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Should it? If you were running nfs and smb, would you necessarily want to fail the ip over? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] Sent: 02 September 2016 16:10 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? I've verified the upgrade has fixed this issue so thanks again. However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ulmer at ulmer.org Fri Sep 2 17:02:44 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 2 Sep 2016 12:02:44 -0400 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. Liberty, -- Stephen > On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: > > A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. > > Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) > Sent: 02 September 2016 16:16 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > > Should it? > > If you were running nfs and smb, would you necessarily want to fail the ip over? > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] > Sent: 02 September 2016 16:10 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > I've verified the upgrade has fixed this issue so thanks again. > > However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. > > Richard > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez > Sent: 27 August 2016 13:53 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > Hi Richard > > This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly > > > Just do this. > > edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py > > > > Change > > if authType == 'ad' and not nodeState.nfsStopped: > > to > > > > nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) > if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: > > > You need to stop the gpfs service in each node where you apply the change > > > " after change the lines please use tap key" > > > > Enviado desde mi iPhone > > El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) > 2. Re: CES and mmuserauth command (Christof Schmitt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Message-ID: > > > > Content-Type: text/plain; charset="UTF-8" > > That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: "Sobey, Richard A" > > To: "'gpfsug-discuss at spectrumscale.org'" > > > Date: 08/26/2016 04:48 AM > Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. > > I?m trying to stop SMB on a node, but getting the following output: > > [root at cesnode ~]# mmces service stop smb > smb: Request denied. Please stop NFS first > > [root at cesnode ~]# mmces service list > Enabled services: SMB > SMB is running > > As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. > > SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. > > Richard_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Message-ID: > > > > Content-Type: text/plain; charset="ISO-2022-JP" > > The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. > > The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and > LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: Jan-Frode Myklebust > > To: gpfsug main discussion list > > Date: 08/26/2016 05:59 AM > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: > > When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). > > > That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: > > ------------------------------------------------------------------------------------------------------ > ??user?name userName > Specifies the user name to be used to perform operations > against the authentication server. The specified user > name must have sufficient permissions to read user and > group attributes from the authentication server. > ------------------------------------------------------------------------------------------------------- > > Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? > > > > -jf_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 55, Issue 44 > ********************************************** > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From laurence at qsplace.co.uk Fri Sep 2 18:54:02 2016 From: laurence at qsplace.co.uk (Laurence Horrors-Barlow) Date: Fri, 2 Sep 2016 19:54:02 +0200 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: <721250E5-767B-4C44-A9E1-5DD255FD4F7D@qsplace.co.uk> I believe the services auto restart on a crash (or kill), a change I noticed between 4.1.1 and 4.2 hence no IP fail over. Suspending a node to force a fail over is possible the most sensible approach. -- Lauz Sent from my iPad > On 2 Sep 2016, at 18:02, Stephen Ulmer wrote: > > I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. > > If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. > > Liberty, > > -- > Stephen > > > >> On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: >> >> A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. >> >> Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) >> Sent: 02 September 2016 16:16 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> >> Should it? >> >> If you were running nfs and smb, would you necessarily want to fail the ip over? >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] >> Sent: 02 September 2016 16:10 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> I've verified the upgrade has fixed this issue so thanks again. >> >> However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. >> >> Richard >> >> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez >> Sent: 27 August 2016 13:53 >> To: gpfsug-discuss at spectrumscale.org >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> Hi Richard >> >> This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly >> >> >> Just do this. >> >> edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py >> >> >> >> Change >> >> if authType == 'ad' and not nodeState.nfsStopped: >> >> to >> >> >> >> nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) >> if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: >> >> >> You need to stop the gpfs service in each node where you apply the change >> >> >> " after change the lines please use tap key" >> >> >> >> Enviado desde mi iPhone >> >> El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: >> Send gpfsug-discuss mailing list submissions to >> gpfsug-discuss at spectrumscale.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> or, via email, send a message with subject or body 'help' to >> gpfsug-discuss-request at spectrumscale.org >> >> You can reach the person managing the list at >> gpfsug-discuss-owner at spectrumscale.org >> >> When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." >> >> >> Today's Topics: >> >> 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) >> 2. Re: CES and mmuserauth command (Christof Schmitt) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 26 Aug 2016 12:29:31 -0400 >> From: "Christof Schmitt" > >> To: gpfsug main discussion list > >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> Message-ID: >> > >> >> Content-Type: text/plain; charset="UTF-8" >> >> That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. >> >> Regards, >> >> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ >> christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) >> >> >> >> From: "Sobey, Richard A" > >> To: "'gpfsug-discuss at spectrumscale.org'" >> > >> Date: 08/26/2016 04:48 AM >> Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> >> >> Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. >> >> I?m trying to stop SMB on a node, but getting the following output: >> >> [root at cesnode ~]# mmces service stop smb >> smb: Request denied. Please stop NFS first >> >> [root at cesnode ~]# mmces service list >> Enabled services: SMB >> SMB is running >> >> As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. >> >> SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. >> >> Richard_______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Fri, 26 Aug 2016 12:29:31 -0400 >> From: "Christof Schmitt" > >> To: gpfsug main discussion list > >> Subject: Re: [gpfsug-discuss] CES and mmuserauth command >> Message-ID: >> > >> >> Content-Type: text/plain; charset="ISO-2022-JP" >> >> The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. >> >> The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and >> LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. >> >> Regards, >> >> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ >> christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) >> >> >> >> From: Jan-Frode Myklebust > >> To: gpfsug main discussion list > >> Date: 08/26/2016 05:59 AM >> Subject: Re: [gpfsug-discuss] CES and mmuserauth command >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> >> >> >> On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: >> >> When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). >> >> >> That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: >> >> ------------------------------------------------------------------------------------------------------ >> ??user?name userName >> Specifies the user name to be used to perform operations >> against the authentication server. The specified user >> name must have sufficient permissions to read user and >> group attributes from the authentication server. >> ------------------------------------------------------------------------------------------------------- >> >> Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? >> >> >> >> -jf_______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> >> ------------------------------ >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> End of gpfsug-discuss Digest, Vol 55, Issue 44 >> ********************************************** >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From christof.schmitt at us.ibm.com Fri Sep 2 19:20:45 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 2 Sep 2016 11:20:45 -0700 Subject: [gpfsug-discuss] CES and mmuserauth command In-Reply-To: References: Message-ID: After looking into this again, the source of confusion is probably from the fact that there are three different authentication schemes present here: When configuring a LDAP server for file or object authentication, then the specified server, user and password are used during normal operations for querying user data. The same applies for configuring object authentication with AD; AD is here treated as a LDAP server. Configuring AD for file authentication is different in that during the "mmuserauth service create", the machine account is created, and then that account is used to connect to a DC that is chosen from the DCs discovered through DNS and not necessarily the one used for the initial configuration. I submitted an internal request to explain this better in the mmuserauth manpage. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Christof Schmitt/Tucson/IBM at IBMUS To: gpfsug main discussion list Date: 08/26/2016 09:30 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From r.sobey at imperial.ac.uk Fri Sep 2 22:02:03 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 21:02:03 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: That makes more sense putting it that way. Cheers Richard Get Outlook for Android On Fri, Sep 2, 2016 at 5:04 PM +0100, "Stephen Ulmer" > wrote: I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. Liberty, -- Stephen > On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: > > A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. > > Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) > Sent: 02 September 2016 16:16 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > > Should it? > > If you were running nfs and smb, would you necessarily want to fail the ip over? > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] > Sent: 02 September 2016 16:10 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > I've verified the upgrade has fixed this issue so thanks again. > > However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. > > Richard > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez > Sent: 27 August 2016 13:53 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > Hi Richard > > This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly > > > Just do this. > > edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py > > > > Change > > if authType == 'ad' and not nodeState.nfsStopped: > > to > > > > nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) > if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: > > > You need to stop the gpfs service in each node where you apply the change > > > " after change the lines please use tap key" > > > > Enviado desde mi iPhone > > El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) > 2. Re: CES and mmuserauth command (Christof Schmitt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Message-ID: > > > > Content-Type: text/plain; charset="UTF-8" > > That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: "Sobey, Richard A" > > To: "'gpfsug-discuss at spectrumscale.org'" > > > Date: 08/26/2016 04:48 AM > Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. > > I?m trying to stop SMB on a node, but getting the following output: > > [root at cesnode ~]# mmces service stop smb > smb: Request denied. Please stop NFS first > > [root at cesnode ~]# mmces service list > Enabled services: SMB > SMB is running > > As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. > > SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. > > Richard_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Message-ID: > > > > Content-Type: text/plain; charset="ISO-2022-JP" > > The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. > > The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and > LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: Jan-Frode Myklebust > > To: gpfsug main discussion list > > Date: 08/26/2016 05:59 AM > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: > > When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). > > > That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: > > ------------------------------------------------------------------------------------------------------ > ??user?name userName > Specifies the user name to be used to perform operations > against the authentication server. The specified user > name must have sufficient permissions to read user and > group attributes from the authentication server. > ------------------------------------------------------------------------------------------------------- > > Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? > > > > -jf_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 55, Issue 44 > ********************************************** > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bauer at cesnet.cz Mon Sep 5 14:30:54 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Mon, 5 Sep 2016 15:30:54 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Message-ID: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From janfrode at tanso.net Mon Sep 5 14:51:44 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 05 Sep 2016 13:51:44 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Message-ID: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer : > Hello, > > is there any way to recall a migrated file back to a regular state > (other than renaming a file)? I would like to free some space > on an external pool (TSM), that is being used by migrated files. > And it would be desirable to prevent repeated backups of an > already backed-up data (due to changed ctime/inode). > > I guess that you can acheive only premigrated state with dsmrecall tool > (two copies of file data - one on GPFS pool and one on external pool). > Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think > it's safe, nor clean :). > > Thank you in advance, > > -- > Miroslav Bauer > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bauer at cesnet.cz Mon Sep 5 15:13:42 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Mon, 5 Sep 2016 16:13:42 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Message-ID: <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: > I believe what you're looking for is dsmrecall -RESident. Plus > reconcile on tsm-server to free up the space. > > Ref: > > http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html > > > -jf > man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: > > Hello, > > is there any way to recall a migrated file back to a regular state > (other than renaming a file)? I would like to free some space > on an external pool (TSM), that is being used by migrated files. > And it would be desirable to prevent repeated backups of an > already backed-up data (due to changed ctime/inode). > > I guess that you can acheive only premigrated state with dsmrecall > tool > (two copies of file data - one on GPFS pool and one on external pool). > Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't > think > it's safe, nor clean :). > > Thank you in advance, > > -- > Miroslav Bauer > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From mark.birmingham at stfc.ac.uk Mon Sep 5 15:27:29 2016 From: mark.birmingham at stfc.ac.uk (mark.birmingham at stfc.ac.uk) Date: Mon, 5 Sep 2016 14:27:29 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> Message-ID: <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> Yes, that's fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.birmingham at stfc.ac.uk Mon Sep 5 15:30:53 2016 From: mark.birmingham at stfc.ac.uk (mark.birmingham at stfc.ac.uk) Date: Mon, 5 Sep 2016 14:30:53 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> Message-ID: <47B8D67E32CC2D44A587CD18636BECC82BB3A62A@exchmbx01> Sorry All! Noob error - replied to the wrong email!!! Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of mark.birmingham at stfc.ac.uk Sent: 05 September 2016 15:27 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Yes, that's fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominic.mueller at de.ibm.com Tue Sep 6 13:04:36 2016 From: dominic.mueller at de.ibm.com (Dominic Mueller-Wicke01) Date: Tue, 6 Sep 2016 14:04:36 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: Hi Miroslav, please use the command: > dsmrecall -resident -detail or use it with file lists Greetings, Dominic. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 06.09.2016 13:00 Subject: gpfsug-discuss Digest, Vol 56, Issue 10 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: DMAPI - Unmigrate file to Regular state (mark.birmingham at stfc.ac.uk) ----- Message from on Mon, 5 Sep 2016 14:30:53 +0000 ----- To: Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Sorry All! Noob error ? replied to the wrong email!!! Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of mark.birmingham at stfc.ac.uk Sent: 05 September 2016 15:27 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Yes, that?s fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer : Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From volobuev at us.ibm.com Tue Sep 6 20:06:32 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 6 Sep 2016 12:06:32 -0700 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz><2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: The correct way to accomplish what you're looking for (in particular, changing the fs-wide level of replication) is mmrestripefs -R. This command also takes care of moving data off disks now marked metadataOnly. The restripe job hits an error trying to move blocks of the inode file, i.e. before it gets to actual user data blocks. Note that at this point the metadata replication factor is still 2. This suggests one of two possibilities: (1) there isn't enough actual free space on the remaining metadataOnly disks, (2) there isn't enough space in some failure groups to allocate two replicas. All of this assumes you're operating within a single storage pool. If multiple storage pools are in play, there are other possibilities. 'mmdf' output would be helpful in providing more helpful advice. With the information at hand, I can only suggest trying to accomplish the task in two phases: (a) deallocated extra metadata replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do want to point out that metadata replication is a highly recommended insurance policy to have for your file system. As with other kinds of insurance, you may or may not need it, but if you do end up needing it, you'll be very glad you have it. The costs, in terms of extra metadata space and performance overhead, are very reasonable. yuri From: Miroslav Bauer To: gpfsug-discuss at spectrumscale.org, Date: 09/01/2016 07:29 AM Subject: Re: [gpfsug-discuss] Migration to separate metadata and data disks Sent by: gpfsug-discuss-bounces at spectrumscale.org Yes, failure group id is exactly what I meant :). Unfortunately, mmrestripefs with -R behaves the same as with -r. I also believed that mmrestripefs -R is the correct tool for fixing the replication settings on inodes (according to manpages), but I will try possible solutions you and Marc suggested and let you know how it went. Thank you, -- Miroslav Bauer On 09/01/2016 04:02 PM, Aaron Knister wrote: > Oh! I think you've already provided the info I was looking for :) I > thought that failGroup=3 meant there were 3 failure groups within the > SSDs. I suspect that's not at all what you meant and that actually is > the failure group of all of those disks. That I think explains what's > going on-- there's only one failure group's worth of metadata-capable > disks available and as such GPFS can't place the 2nd replica for > existing files. > > Here's what I would suggest: > > - Create at least 2 failure groups within the SSDs > - Put the default metadata replication factor back to 2 > - Run a restripefs -R to shuffle files around and restore the metadata > replication factor of 2 to any files created while it was set to 1 > > If you're not interested in replication for metadata then perhaps all > you need to do is the mmrestripefs -R. I think that should > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > Hope that helps. > > -Aaron > > On 9/1/16 9:39 AM, Aaron Knister wrote: >> By the way, I suspect the no space on device errors are because GPFS >> believes for some reason that it is unable to maintain the metadata >> replication factor of 2 that's likely set on all previously created >> inodes. >> >> On 9/1/16 9:36 AM, Aaron Knister wrote: >>> I must admit, I'm curious as to the reason you're dropping the >>> replication factor from 2 down to 1. There are some serious advantages >>> we've seen to having multiple metadata replicas, as far as error >>> recovery is concerned. >>> >>> Could you paste an output of mmlsdisk for the filesystem? >>> >>> -Aaron >>> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>>> Hello, >>>> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>>> filesystem metadata from state: >>>> -m = 2 (default metadata replicas) >>>> - SATA disks (dataAndMetadata, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> to the desired state: >>>> -m = 1 >>>> - SATA disks (dataOnly, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> >>>> I have done the following steps in the following order: >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>>> attribute only): >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>>> Attention: Disk parameters were changed. >>>> Use the mmrestripefs command with the -r option to relocate data and >>>> metadata. >>>> Verifying file system configuration information ... >>>> mmchdisk: Propagating the cluster configuration data to all >>>> affected nodes. This is an asynchronous process. >>>> >>>> 2) change default metadata replicas number 2->1 >>>> # mmchfs fs1 -m 1 >>>> >>>> 3) run mmrestripefs as suggested by output of 1) >>>> # mmrestripefs fs1 -r >>>> Scanning file system metadata, phase 1 ... >>>> Error processing inodes. >>>> No space left on device >>>> mmrestripefs: Command failed. Examine previous error messages to >>>> determine cause. >>>> >>>> It is, however, still possible to create new files on the filesystem. >>>> When I return one of the SATA disks as a dataAndMetadata disk, the >>>> mmrestripefs >>>> command stops complaining about No space left on device. Both df and >>>> mmdf >>>> say that there is enough space both for data (SATA) and metadata >>>> (SSDs). >>>> Does anyone have an idea why is it complaining? >>>> >>>> Thanks, >>>> >>>> -- >>>> Miroslav Bauer >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >> > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bauer at cesnet.cz Wed Sep 7 10:40:19 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Wed, 7 Sep 2016 11:40:19 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: Hello Yuri, here goes the actual mmdf output of filesystem in question: disk disk size failure holds holds free free name group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 40 TB) dcsh_10C 5T 1 Yes Yes 1.661T ( 33%) 68.48G ( 1%) dcsh_10D 6.828T 1 Yes Yes 2.809T ( 41%) 83.82G ( 1%) dcsh_11C 5T 1 Yes Yes 1.659T ( 33%) 69.01G ( 1%) dcsh_11D 6.828T 1 Yes Yes 2.81T ( 41%) 83.33G ( 1%) dcsh_12C 5T 1 Yes Yes 1.659T ( 33%) 69.48G ( 1%) dcsh_12D 6.828T 1 Yes Yes 2.807T ( 41%) 83.14G ( 1%) dcsh_13C 5T 1 Yes Yes 1.659T ( 33%) 69.35G ( 1%) dcsh_13D 6.828T 1 Yes Yes 2.81T ( 41%) 82.97G ( 1%) dcsh_14C 5T 1 Yes Yes 1.66T ( 33%) 69.06G ( 1%) dcsh_14D 6.828T 1 Yes Yes 2.811T ( 41%) 83.61G ( 1%) dcsh_15C 5T 1 Yes Yes 1.658T ( 33%) 69.38G ( 1%) dcsh_15D 6.828T 1 Yes Yes 2.814T ( 41%) 83.69G ( 1%) dcsd_15D 6.828T 1 Yes Yes 2.811T ( 41%) 83.98G ( 1%) dcsd_15C 5T 1 Yes Yes 1.66T ( 33%) 68.66G ( 1%) dcsd_14D 6.828T 1 Yes Yes 2.81T ( 41%) 84.18G ( 1%) dcsd_14C 5T 1 Yes Yes 1.659T ( 33%) 69.43G ( 1%) dcsd_13D 6.828T 1 Yes Yes 2.81T ( 41%) 83.27G ( 1%) dcsd_13C 5T 1 Yes Yes 1.66T ( 33%) 69.1G ( 1%) dcsd_12D 6.828T 1 Yes Yes 2.81T ( 41%) 83.61G ( 1%) dcsd_12C 5T 1 Yes Yes 1.66T ( 33%) 69.42G ( 1%) dcsd_11D 6.828T 1 Yes Yes 2.811T ( 41%) 83.59G ( 1%) dcsh_10B 5T 1 Yes Yes 1.633T ( 33%) 76.97G ( 2%) dcsh_11A 5T 1 Yes Yes 1.632T ( 33%) 77.29G ( 2%) dcsh_11B 5T 1 Yes Yes 1.633T ( 33%) 76.73G ( 1%) dcsh_12A 5T 1 Yes Yes 1.634T ( 33%) 76.49G ( 1%) dcsd_11C 5T 1 Yes Yes 1.66T ( 33%) 69.25G ( 1%) dcsd_10D 6.828T 1 Yes Yes 2.811T ( 41%) 83.39G ( 1%) dcsh_10A 5T 1 Yes Yes 1.633T ( 33%) 77.06G ( 2%) dcsd_10C 5T 1 Yes Yes 1.66T ( 33%) 69.83G ( 1%) dcsd_15B 5T 1 Yes Yes 1.635T ( 33%) 76.52G ( 1%) dcsd_15A 5T 1 Yes Yes 1.634T ( 33%) 76.24G ( 1%) dcsd_14B 5T 1 Yes Yes 1.634T ( 33%) 76.31G ( 1%) dcsd_14A 5T 1 Yes Yes 1.634T ( 33%) 76.23G ( 1%) dcsd_13B 5T 1 Yes Yes 1.634T ( 33%) 76.13G ( 1%) dcsd_13A 5T 1 Yes Yes 1.634T ( 33%) 76.22G ( 1%) dcsd_12B 5T 1 Yes Yes 1.635T ( 33%) 77.49G ( 2%) dcsd_12A 5T 1 Yes Yes 1.633T ( 33%) 77.13G ( 2%) dcsd_11B 5T 1 Yes Yes 1.633T ( 33%) 76.86G ( 2%) dcsd_11A 5T 1 Yes Yes 1.632T ( 33%) 76.22G ( 1%) dcsd_10B 5T 1 Yes Yes 1.633T ( 33%) 76.79G ( 1%) dcsd_10A 5T 1 Yes Yes 1.633T ( 33%) 77.21G ( 2%) dcsh_15B 5T 1 Yes Yes 1.635T ( 33%) 76.04G ( 1%) dcsh_15A 5T 1 Yes Yes 1.634T ( 33%) 76.84G ( 2%) dcsh_14B 5T 1 Yes Yes 1.635T ( 33%) 76.75G ( 1%) dcsh_14A 5T 1 Yes Yes 1.633T ( 33%) 76.05G ( 1%) dcsh_13B 5T 1 Yes Yes 1.634T ( 33%) 76.35G ( 1%) dcsh_13A 5T 1 Yes Yes 1.634T ( 33%) 76.68G ( 1%) dcsh_12B 5T 1 Yes Yes 1.635T ( 33%) 76.74G ( 1%) ssd_5_5 80G 3 Yes No 22.31G ( 28%) 7.155G ( 9%) ssd_4_4 80G 3 Yes No 22.21G ( 28%) 7.196G ( 9%) ssd_3_3 80G 3 Yes No 22.2G ( 28%) 7.239G ( 9%) ssd_2_2 80G 3 Yes No 22.24G ( 28%) 7.146G ( 9%) ssd_1_1 80G 3 Yes No 22.29G ( 28%) 7.134G ( 9%) ------------- -------------------- ------------------- (pool total) 262.3T 92.96T ( 35%) 3.621T ( 1%) Disks in storage pool: maid4 (Maximum disk size allowed is 466 TB) ...... ------------- -------------------- ------------------- (pool total) 291T 126.5T ( 43%) 562.6G ( 0%) Disks in storage pool: maid5 (Maximum disk size allowed is 466 TB) ...... ------------- -------------------- ------------------- (pool total) 436.6T 120.8T ( 28%) 25.23G ( 0%) Disks in storage pool: maid6 (Maximum disk size allowed is 466 TB) ....... ------------- -------------------- ------------------- (pool total) 582.1T 358.7T ( 62%) 9.458G ( 0%) ============= ==================== =================== (data) 1.535P 698.9T ( 44%) 4.17T ( 0%) (metadata) 262.3T 92.96T ( 35%) 3.621T ( 1%) ============= ==================== =================== (total) 1.535P 699T ( 44%) 4.205T ( 0%) Inode Information ----------------- Number of used inodes: 79607225 Number of free inodes: 82340423 Number of allocated inodes: 161947648 Maximum number of inodes: 1342177280 I have a smaller testing FS with the same setup (with plenty of free space), and the actual sequence of commands that worked for me was: mmchfs fs1 -m1 mmrestripefs fs1 -R mmrestripefs fs1 -b mmchdisk fs1 change -F ~/nsd_metadata_test (dataAndMetadata -> dataOnly) mmrestripefs fs1 -r Could you please evaluate more on the performance overhead with having metadata on SSD+SATA? Are the read operations automatically directed to faster disks by GPFS? Is each write operation waiting for write to be finished by SATA disks? Thank you, -- Miroslav Bauer On 09/06/2016 09:06 PM, Yuri L Volobuev wrote: > > The correct way to accomplish what you're looking for (in particular, > changing the fs-wide level of replication) is mmrestripefs -R. This > command also takes care of moving data off disks now marked metadataOnly. > > The restripe job hits an error trying to move blocks of the inode > file, i.e. before it gets to actual user data blocks. Note that at > this point the metadata replication factor is still 2. This suggests > one of two possibilities: (1) there isn't enough actual free space on > the remaining metadataOnly disks, (2) there isn't enough space in some > failure groups to allocate two replicas. > > All of this assumes you're operating within a single storage pool. If > multiple storage pools are in play, there are other possibilities. > > 'mmdf' output would be helpful in providing more helpful advice. With > the information at hand, I can only suggest trying to accomplish the > task in two phases: (a) deallocated extra metadata replicas, by doing > mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do > want to point out that metadata replication is a highly recommended > insurance policy to have for your file system. As with other kinds of > insurance, you may or may not need it, but if you do end up needing > it, you'll be very glad you have it. The costs, in terms of extra > metadata space and performance overhead, are very reasonable. > > yuri > > > Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is > exactly what I meant :). Unfortunately, mmrestripefs with -R > > From: Miroslav Bauer > To: gpfsug-discuss at spectrumscale.org, > Date: 09/01/2016 07:29 AM > Subject: Re: [gpfsug-discuss] Migration to separate metadata and data > disks > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Yes, failure group id is exactly what I meant :). Unfortunately, > mmrestripefs with -R > behaves the same as with -r. I also believed that mmrestripefs -R is the > correct tool for > fixing the replication settings on inodes (according to manpages), but I > will try possible > solutions you and Marc suggested and let you know how it went. > > Thank you, > -- > Miroslav Bauer > > On 09/01/2016 04:02 PM, Aaron Knister wrote: > > Oh! I think you've already provided the info I was looking for :) I > > thought that failGroup=3 meant there were 3 failure groups within the > > SSDs. I suspect that's not at all what you meant and that actually is > > the failure group of all of those disks. That I think explains what's > > going on-- there's only one failure group's worth of metadata-capable > > disks available and as such GPFS can't place the 2nd replica for > > existing files. > > > > Here's what I would suggest: > > > > - Create at least 2 failure groups within the SSDs > > - Put the default metadata replication factor back to 2 > > - Run a restripefs -R to shuffle files around and restore the metadata > > replication factor of 2 to any files created while it was set to 1 > > > > If you're not interested in replication for metadata then perhaps all > > you need to do is the mmrestripefs -R. I think that should > > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > > > Hope that helps. > > > > -Aaron > > > > On 9/1/16 9:39 AM, Aaron Knister wrote: > >> By the way, I suspect the no space on device errors are because GPFS > >> believes for some reason that it is unable to maintain the metadata > >> replication factor of 2 that's likely set on all previously created > >> inodes. > >> > >> On 9/1/16 9:36 AM, Aaron Knister wrote: > >>> I must admit, I'm curious as to the reason you're dropping the > >>> replication factor from 2 down to 1. There are some serious advantages > >>> we've seen to having multiple metadata replicas, as far as error > >>> recovery is concerned. > >>> > >>> Could you paste an output of mmlsdisk for the filesystem? > >>> > >>> -Aaron > >>> > >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: > >>>> Hello, > >>>> > >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > >>>> filesystem metadata from state: > >>>> -m = 2 (default metadata replicas) > >>>> - SATA disks (dataAndMetadata, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> to the desired state: > >>>> -m = 1 > >>>> - SATA disks (dataOnly, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> > >>>> I have done the following steps in the following order: > >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > >>>> attribute only): > >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza > >>>> Attention: Disk parameters were changed. > >>>> Use the mmrestripefs command with the -r option to relocate > data and > >>>> metadata. > >>>> Verifying file system configuration information ... > >>>> mmchdisk: Propagating the cluster configuration data to all > >>>> affected nodes. This is an asynchronous process. > >>>> > >>>> 2) change default metadata replicas number 2->1 > >>>> # mmchfs fs1 -m 1 > >>>> > >>>> 3) run mmrestripefs as suggested by output of 1) > >>>> # mmrestripefs fs1 -r > >>>> Scanning file system metadata, phase 1 ... > >>>> Error processing inodes. > >>>> No space left on device > >>>> mmrestripefs: Command failed. Examine previous error messages to > >>>> determine cause. > >>>> > >>>> It is, however, still possible to create new files on the filesystem. > >>>> When I return one of the SATA disks as a dataAndMetadata disk, the > >>>> mmrestripefs > >>>> command stops complaining about No space left on device. Both df and > >>>> mmdf > >>>> say that there is enough space both for data (SATA) and metadata > >>>> (SSDs). > >>>> Does anyone have an idea why is it complaining? > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Miroslav Bauer > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >> > > > > > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From S.J.Thompson at bham.ac.uk Wed Sep 7 13:36:48 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Wed, 7 Sep 2016 12:36:48 +0000 Subject: [gpfsug-discuss] Remote cluster mount failing Message-ID: Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon From volobuev at us.ibm.com Wed Sep 7 17:38:03 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Wed, 7 Sep 2016 09:38:03 -0700 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz><2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov><505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: Hi Miroslav, The mmdf output is very helpful. It suggests very strongly what the problem is: > ssd_5_5?????????????????? 80G??????? 3 Yes????? No?????????? 22.31G ( 28%)??????? 7.155G ( 9%) > ssd_4_4?????????????????? 80G??????? 3 Yes????? No?????????? 22.21G ( 28%)??????? 7.196G ( 9%) > ssd_3_3?????????????????? 80G??????? 3 Yes????? No??????????? 22.2G ( 28%)??????? 7.239G ( 9%) > ssd_2_2?????????????????? 80G??????? 3 Yes????? No?????????? 22.24G ( 28%)??????? 7.146G ( 9%) > ssd_1_1?????????????????? 80G??????? 3 Yes????? No?????????? 22.29G ( 28%)??????? 7.134G ( 9%) >... > ==================== =================== > (data)???????????????? 1.535P??????????????????????????????? 698.9T ( 44%)???????? 4.17T ( 0%) > (metadata)???????????? 262.3T??????????????????????????????? 92.96T ( 35%)??????? 3.621T ( 1%) >... > Number of allocated inodes:? 161947648 > Maximum number of inodes:?? 1342177280 You have 5 80G SSDs. That's not enough. Even with metadata spread across a couple dozen more SATA disks, SSDs are over 3/4 full. There's no way to accurately estimate the amount of metadata in this file system with the data at hand, but if we (very conservatively) assume that each SATA disk has only as much metadata as each SSD, i.e. ~57G, that would greatly exceed the amount of free space available on your SSDs. You need more free metadata space. Another way to look at this: you got 1.5PB of data under management. A reasonable rule-of-thumb estimate for the amount of metadata is 1-2% of the data (this is a typical ratio, but of course every file system is different, and large deviations are possible. A degenerate case is an fs containing nothing but directories, and in this case metadata usage is 100%). So you have to have at least a few TB of metadata storage. 5 80G SSDs aren't enough for an fs of this size. > Could you please evaluate more on the performance overhead with > having metadata > on SSD+SATA? Are the read operations automatically directed to > faster disks by GPFS? > Is each write operation waiting for write to be finished by SATA disks? Mixing disks with sharply different performance characteristics within a single storage pool is detrimental to performance. GPFS stripes blocks across all disks in a storage pool, expecting all of them to be equally suitable. If SSDs are mixed with SATA disks, the overall metadata write performance is going to be bottlenecked by SATA drives. On reads, given a choice of two replicas, GPFS V4.1.1+ picks the the replica residing on the fastest disk, but given that SSDs represent only a small fraction of your total metadata usage, this likely doesn't help a whole lot. You're on the right track in trying to shift all metadata to SSDs and away from SATA, the overall file system performance will improve as the result. yuri > > Thank you, > -- > Miroslav Bauer > On 09/06/2016 09:06 PM, Yuri L Volobuev wrote: > The correct way to accomplish what you're looking for (in > particular, changing the fs-wide level of replication) is > mmrestripefs -R. This command also takes care of moving data off > disks now marked metadataOnly. > > The restripe job hits an error trying to move blocks of the inode > file, i.e. before it gets to actual user data blocks. Note that at > this point the metadata replication factor is still 2. This suggests > one of two possibilities: (1) there isn't enough actual free space > on the remaining metadataOnly disks, (2) there isn't enough space in > some failure groups to allocate two replicas. > > All of this assumes you're operating within a single storage pool. > If multiple storage pools are in play, there are other possibilities. > > 'mmdf' output would be helpful in providing more helpful advice. > With the information at hand, I can only suggest trying to > accomplish the task in two phases: (a) deallocated extra metadata > replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata > off SATA disks. I do want to point out that metadata replication is > a highly recommended insurance policy to have for your file system. > As with other kinds of insurance, you may or may not need it, but if > you do end up needing it, you'll be very glad you have it. The > costs, in terms of extra metadata space and performance overhead, > are very reasonable. > > yuri > > > Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is > exactly what I meant :). Unfortunately, mmrestripefs with -R > > From: Miroslav Bauer > To: gpfsug-discuss at spectrumscale.org, > Date: 09/01/2016 07:29 AM > Subject: Re: [gpfsug-discuss] Migration to separate metadata and data disks > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Yes, failure group id is exactly what I meant :). Unfortunately, > mmrestripefs with -R > behaves the same as with -r. I also believed that mmrestripefs -R is the > correct tool for > fixing the replication settings on inodes (according to manpages), but I > will try possible > solutions you and Marc suggested and let you know how it went. > > Thank you, > -- > Miroslav Bauer > > On 09/01/2016 04:02 PM, Aaron Knister wrote: > > Oh! I think you've already provided the info I was looking for :) I > > thought that failGroup=3 meant there were 3 failure groups within the > > SSDs. I suspect that's not at all what you meant and that actually is > > the failure group of all of those disks. That I think explains what's > > going on-- there's only one failure group's worth of metadata-capable > > disks available and as such GPFS can't place the 2nd replica for > > existing files. > > > > Here's what I would suggest: > > > > - Create at least 2 failure groups within the SSDs > > - Put the default metadata replication factor back to 2 > > - Run a restripefs -R to shuffle files around and restore the metadata > > replication factor of 2 to any files created while it was set to 1 > > > > If you're not interested in replication for metadata then perhaps all > > you need to do is the mmrestripefs -R. I think that should > > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > > > Hope that helps. > > > > -Aaron > > > > On 9/1/16 9:39 AM, Aaron Knister wrote: > >> By the way, I suspect the no space on device errors are because GPFS > >> believes for some reason that it is unable to maintain the metadata > >> replication factor of 2 that's likely set on all previously created > >> inodes. > >> > >> On 9/1/16 9:36 AM, Aaron Knister wrote: > >>> I must admit, I'm curious as to the reason you're dropping the > >>> replication factor from 2 down to 1. There are some serious advantages > >>> we've seen to having multiple metadata replicas, as far as error > >>> recovery is concerned. > >>> > >>> Could you paste an output of mmlsdisk for the filesystem? > >>> > >>> -Aaron > >>> > >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: > >>>> Hello, > >>>> > >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > >>>> filesystem metadata from state: > >>>> -m = 2 (default metadata replicas) > >>>> - SATA disks (dataAndMetadata, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> to the desired state: > >>>> -m = 1 > >>>> - SATA disks (dataOnly, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> > >>>> I have done the following steps in the following order: > >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > >>>> attribute only): > >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza > >>>> Attention: Disk parameters were changed. > >>>> ? Use the mmrestripefs command with the -r option to relocate data and > >>>> metadata. > >>>> Verifying file system configuration information ... > >>>> mmchdisk: Propagating the cluster configuration data to all > >>>> ? affected nodes. ?This is an asynchronous process. > >>>> > >>>> 2) change default metadata replicas number 2->1 > >>>> # mmchfs fs1 -m 1 > >>>> > >>>> 3) run mmrestripefs as suggested by output of 1) > >>>> # mmrestripefs fs1 -r > >>>> Scanning file system metadata, phase 1 ... > >>>> Error processing inodes. > >>>> No space left on device > >>>> mmrestripefs: Command failed. ?Examine previous error messages to > >>>> determine cause. > >>>> > >>>> It is, however, still possible to create new files on the filesystem. > >>>> When I return one of the SATA disks as a dataAndMetadata disk, the > >>>> mmrestripefs > >>>> command stops complaining about No space left on device. Both df and > >>>> mmdf > >>>> say that there is enough space both for data (SATA) and metadata > >>>> (SSDs). > >>>> Does anyone have an idea why is it complaining? > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Miroslav Bauer > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >> > > > > > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Wed Sep 7 17:58:07 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Wed, 7 Sep 2016 09:58:07 -0700 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Valdis.Kletnieks at vt.edu Wed Sep 7 19:45:43 2016 From: Valdis.Kletnieks at vt.edu (Valdis Kletnieks) Date: Wed, 07 Sep 2016 14:45:43 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' Message-ID: <27691.1473273943@turing-police.cc.vt.edu> We're in the middle of deploying Spectrum Archive, and I've hit a snag. We assigned some floating IP addresses, which now need to be changed. So I look at the mmces manpage, and it looks like I need to add the new addresses, and delete the old ones. We're on GPFS 4.2.1.0, if that matters... What 'man mmces' says: 1. To add an address to a specified node, issue this command: mmces address add --ces-node node1 --ces-ip 10.1.2.3 (and at least 6 or 8 more uses of an IP address). What happens when I try it: (And yes, we have an 'isb' ces-group defined with addresses in it already) # mmces address add --ces-group isb --ces-ip 172.28.45.72 Cannot resolve 172.28.45.72; Name or service not known mmces address add: Incorrect value for --ces-ip option Usage: mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group Group] {--ces-ip {IP[,IP...]} Am I missing some special sauce? (My first guess is that it's complaining because there's no PTR in the DNS for that address yet - but if it was going to do DNS lookups, it should be valid to give a hostname rather than an IP address (and nowhere in the manpage does it even *hint* that --ces-ip can be anything other than a list of IP addresses). Or is it time for me to file a PMR? From xhejtman at ics.muni.cz Wed Sep 7 21:11:11 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 7 Sep 2016 22:11:11 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: <20160907201111.xmksazqjekk2ihsy@ics.muni.cz> On Tue, Sep 06, 2016 at 02:04:36PM +0200, Dominic Mueller-Wicke01 wrote: > Hi Miroslav, > > please use the command: > dsmrecall -resident -detail > or use it with file lists well, it looks like Client Version 7, Release 1, Level 4.4 leaks file descriptors: 09/07/2016 21:03:07 ANS1587W Unable to read extended attributes for object /exports/tape_tape/VO_metacentrum/home/jfeit/atlases/atlases/novo3/atlases/images/.svn/prop-base due to errno: 24, reason: Too many open files after about 15 minutes of run, I can see 88 opened files in /proc/$PID/fd when using: dsmrecall -R -RESid -D /path/* is it something known fixed in newer versions? -- Luk?? Hejtm?nek From taylorm at us.ibm.com Wed Sep 7 21:40:13 2016 From: taylorm at us.ibm.com (Michael L Taylor) Date: Wed, 7 Sep 2016 13:40:13 -0700 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: Message-ID: Can't be for certain this is what you're hitting but reverse DNS lookup is documented the KC: http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_protocolnodeipfurtherconfig.htm Note: All CES IPs must have an associated hostname and reverse DNS lookup must be configured for each. For more information, see Adding export IPs in Deploying protocols. http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_deployingprotocolstasks.htm Note: Export IPs must have an associated hostname and reverse DNS lookup must be configured for each. Can you make sure the IPs have reverse DNS lookup and try again? Will get the mmces man page updated for address add -------------- next part -------------- An HTML attachment was scrubbed... URL: From Valdis.Kletnieks at vt.edu Wed Sep 7 22:23:30 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Wed, 07 Sep 2016 17:23:30 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: Message-ID: <41089.1473283410@turing-police.cc.vt.edu> On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > Can't be for certain this is what you're hitting but reverse DNS lookup is > documented the KC: > Note: All CES IPs must have an associated hostname and reverse DNS lookup > must be configured for each. For more information, see Adding export IPs in > Deploying protocols. Bingo. That was it. Since the DNS will take a while to fix, I fed the appropriate entries to /etc/hosts and it worked fine. I got thrown for a loop because if there is enough code to do that checking, it should be able to accept a hostname as well (RFE time? :) From ulmer at ulmer.org Wed Sep 7 22:34:07 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 7 Sep 2016 17:34:07 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: <41089.1473283410@turing-police.cc.vt.edu> References: <41089.1473283410@turing-police.cc.vt.edu> Message-ID: Hostnames can have many A records. IPs *generally* only have one PTR (though it?s not restricted, multiple PTRs is not recommended). Just knowing that you can see why allowing names would create more questions than it answers. So if it did take names instead of IP addresses, it would usually only do what you meant part of the time -- and sometimes none of the time. :) -- Stephen > On Sep 7, 2016, at 5:23 PM, Valdis.Kletnieks at vt.edu wrote: > > On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > >> Can't be for certain this is what you're hitting but reverse DNS lookup is >> documented the KC: > >> Note: All CES IPs must have an associated hostname and reverse DNS lookup >> must be configured for each. For more information, see Adding export IPs in >> Deploying protocols. > > Bingo. That was it. Since the DNS will take a while to fix, I fed > the appropriate entries to /etc/hosts and it worked fine. > > I got thrown for a loop because if there is enough code to do that checking, > it should be able to accept a hostname as well (RFE time? :) > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Valdis.Kletnieks at vt.edu Wed Sep 7 22:54:05 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Wed, 07 Sep 2016 17:54:05 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: <41089.1473283410@turing-police.cc.vt.edu> Message-ID: <43934.1473285245@turing-police.cc.vt.edu> On Wed, 07 Sep 2016 17:34:07 -0400, Stephen Ulmer said: > Hostnames can have many A records. And quad-A records. :) (Despite our best efforts, we're still one of the 100 biggest IPv6 deployments according to http://www.worldipv6launch.org/measurements/ - were's sitting at 84th in traffic volume and 18th by percent penetration, mostly because we deployed it in production literally last century...) From janfrode at tanso.net Thu Sep 8 06:08:47 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 08 Sep 2016 05:08:47 +0000 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: <27691.1473273943@turing-police.cc.vt.edu> References: <27691.1473273943@turing-police.cc.vt.edu> Message-ID: I believe your first guess is correct. The ces-ip needs to be resolvable for some reason... Just put a name for it in /etc/hosts, if you can't add it to your dns. -jf ons. 7. sep. 2016 kl. 20.45 skrev Valdis Kletnieks : > We're in the middle of deploying Spectrum Archive, and I've hit a > snag. We assigned some floating IP addresses, which now need to > be changed. So I look at the mmces manpage, and it looks like I need > to add the new addresses, and delete the old ones. > > We're on GPFS 4.2.1.0, if that matters... > > What 'man mmces' says: > > 1. To add an address to a specified node, issue this command: > > mmces address add --ces-node node1 --ces-ip 10.1.2.3 > > (and at least 6 or 8 more uses of an IP address). > > What happens when I try it: (And yes, we have an 'isb' ces-group defined > with > addresses in it already) > > # mmces address add --ces-group isb --ces-ip 172.28.45.72 > Cannot resolve 172.28.45.72; Name or service not known > mmces address add: Incorrect value for --ces-ip option > Usage: > mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group > Group] > {--ces-ip {IP[,IP...]} > > Am I missing some special sauce? (My first guess is that it's complaining > because there's no PTR in the DNS for that address yet - but if it was > going > to do DNS lookups, it should be valid to give a hostname rather than an IP > address (and nowhere in the manpage does it even *hint* that --ces-ip can > be anything other than a list of IP addresses). > > Or is it time for me to file a PMR? > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominic.mueller at de.ibm.com Thu Sep 8 06:35:55 2016 From: dominic.mueller at de.ibm.com (Dominic Mueller-Wicke01) Date: Thu, 8 Sep 2016 07:35:55 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: Please open a PMR for the not working "recall to resident". Some investigation is needed here. Thanks. Greetings, Dominic. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 07.09.2016 23:23 Subject: gpfsug-discuss Digest, Vol 56, Issue 14 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Remote cluster mount failing (Yuri L Volobuev) 2. Weirdness with 'mmces address add' (Valdis Kletnieks) 3. Re: DMAPI - Unmigrate file to Regular state (Lukas Hejtmanek) 4. Weirdness with 'mmces address add' (Michael L Taylor) 5. Re: Weirdness with 'mmces address add' (Valdis.Kletnieks at vt.edu) ----- Message from "Yuri L Volobuev" on Wed, 7 Sep 2016 09:58:07 -0700 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- Message from Valdis Kletnieks on Wed, 07 Sep 2016 14:45:43 -0400 ----- To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Weirdness with 'mmces address add' We're in the middle of deploying Spectrum Archive, and I've hit a snag. We assigned some floating IP addresses, which now need to be changed. So I look at the mmces manpage, and it looks like I need to add the new addresses, and delete the old ones. We're on GPFS 4.2.1.0, if that matters... What 'man mmces' says: 1. To add an address to a specified node, issue this command: mmces address add --ces-node node1 --ces-ip 10.1.2.3 (and at least 6 or 8 more uses of an IP address). What happens when I try it: (And yes, we have an 'isb' ces-group defined with addresses in it already) # mmces address add --ces-group isb --ces-ip 172.28.45.72 Cannot resolve 172.28.45.72; Name or service not known mmces address add: Incorrect value for --ces-ip option Usage: mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group Group] {--ces-ip {IP[,IP...]} Am I missing some special sauce? (My first guess is that it's complaining because there's no PTR in the DNS for that address yet - but if it was going to do DNS lookups, it should be valid to give a hostname rather than an IP address (and nowhere in the manpage does it even *hint* that --ces-ip can be anything other than a list of IP addresses). Or is it time for me to file a PMR? ----- Message from Lukas Hejtmanek on Wed, 7 Sep 2016 22:11:11 +0200 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state On Tue, Sep 06, 2016 at 02:04:36PM +0200, Dominic Mueller-Wicke01 wrote: > Hi Miroslav, > > please use the command: > dsmrecall -resident -detail > or use it with file lists well, it looks like Client Version 7, Release 1, Level 4.4 leaks file descriptors: 09/07/2016 21:03:07 ANS1587W Unable to read extended attributes for object /exports/tape_tape/VO_metacentrum/home/jfeit/atlases/atlases/novo3/atlases/images/.svn/prop-base due to errno: 24, reason: Too many open files after about 15 minutes of run, I can see 88 opened files in /proc/$PID/fd when using: dsmrecall -R -RESid -D /path/* is it something known fixed in newer versions? -- Luk?? Hejtm?nek ----- Message from "Michael L Taylor" on Wed, 7 Sep 2016 13:40:13 -0700 ----- To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Weirdness with 'mmces address add' Can't be for certain this is what you're hitting but reverse DNS lookup is documented the KC: http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_protocolnodeipfurtherconfig.htm Note: All CES IPs must have an associated hostname and reverse DNS lookup must be configured for each. For more information, see Adding export IPs in Deploying protocols. http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_deployingprotocolstasks.htm Note: Export IPs must have an associated hostname and reverse DNS lookup must be configured for each. Can you make sure the IPs have reverse DNS lookup and try again? Will get the mmces man page updated for address add ----- Message from Valdis.Kletnieks at vt.edu on Wed, 07 Sep 2016 17:23:30 -0400 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Weirdness with 'mmces address add' On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > Can't be for certain this is what you're hitting but reverse DNS lookup is > documented the KC: > Note: All CES IPs must have an associated hostname and reverse DNS lookup > must be configured for each. For more information, see Adding export IPs in > Deploying protocols. Bingo. That was it. Since the DNS will take a while to fix, I fed the appropriate entries to /etc/hosts and it worked fine. I got thrown for a loop because if there is enough code to do that checking, it should be able to accept a hostname as well (RFE time? :) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Fri Sep 9 15:37:28 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 9 Sep 2016 14:37:28 +0000 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: > on behalf of Yuri L Volobuev > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri [Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to]"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" > To: "gpfsug-discuss at spectrumscale.org" >, Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From volobuev at us.ibm.com Fri Sep 9 17:29:35 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Fri, 9 Sep 2016 09:29:35 -0700 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: It could be "easy" in the end, e.g. regenerating the key ("mmauth genkey new") may fix the issue. Figuring out exactly what is going wrong is messy though, and requires looking at a number of debug data points, something that's awkward to do on a public mailing list. I don't think you want to post certificates et al on a mailing list. The PMR channel is more appropriate for this kind of thing. yuri From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list , Date: 09/09/2016 07:37 AM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: on behalf of Yuri L Volobuev Reply-To: "gpfsug-discuss at spectrumscale.org" < gpfsug-discuss at spectrumscale.org> Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" < S.J.Thompson at bham.ac.uk> To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "graycol.gif" deleted by Yuri L Volobuev/Austin/IBM] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bbanister at jumptrading.com Sat Sep 10 22:50:25 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Sat, 10 Sep 2016 21:50:25 +0000 Subject: [gpfsug-discuss] Edge Attendees In-Reply-To: References: Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB063297AB@CHI-EXCHANGEW1.w2k.jumptrading.com> Hi Doug, Found out that I get to attend this year. Please put me down for the SS NDA round-table, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Douglas O'flaherty Sent: Monday, August 29, 2016 12:34 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Edge Attendees Greetings: I am organizing an NDA round-table with the IBM Offering Managers at IBM Edge on Tuesday, September 20th at 1pm. The subject will be "The Future of IBM Spectrum Scale." IBM Offering Managers are the Product Owners at IBM. There will be discussions covering licensing, the roadmap for IBM Spectrum Scale RAID (aka GNR), new hardware platforms, etc. This is a unique opportunity to get feedback to the drivers of the IBM Spectrum Scale business plans. It should be a great companion to the content we get from Engineering and Research at most User Group meetings. To get an invitation, please email me privately at douglasof us.ibm.com. All who have a valid NDA are invited. I only need an approximate headcount of attendees. Try not to spam the mailing list. I am pushing to get the Offering Managers to have a similar session at SC16 as an IBM Multi-client Briefing. You can add your voice to that call on this thread, or email me directly. Spectrum Scale User Group at SC16 will once again take place on Sunday afternoon with cocktails to follow. I hope we can blow out the attendance numbers and the number of site speakers we had last year! I know Simon, Bob, and Kristy are already working the agenda. Get your ideas in to them or to me. See you in Vegas, Vegas, SLC, Vegas this Fall... Maybe Australia in between? doug Douglas O'Flaherty IBM Spectrum Storage Marketing ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Sun Sep 11 22:02:48 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Sun, 11 Sep 2016 21:02:48 +0000 Subject: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> Hi Everyone, A while back I seem to recall hearing about a mechanism being developed that would function similarly to Lustre's LNET routers and effectively allow a single set of NSD servers to talk to multiple RDMA fabrics without requiring the NSD servers to have infiniband interfaces on each RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each fabric that would in effect proxy the RDMA requests to the NSD server. Does anyone know what I'm talking about? Just curious if it's still on the roadmap. -Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Sun Sep 11 23:31:56 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Sun, 11 Sep 2016 22:31:56 +0000 Subject: [gpfsug-discuss] Grafana Bridge Code - for GPFS Performance Sensors - Now on the IBM Wiki Message-ID: <2B003708-B2E3-474B-8035-F3A080CB2EAF@nuance.com> IBM has formally published this bridge code - and you can get the details and download it here: IBM Spectrum Scale Performance Monitoring Bridge https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/IBM%20Spectrum%20Scale%20Performance Monitoring%20Bridge Also, see this Storage Community Blog Post (it references version 4.2.2, but I think they mean 4.2.1) http://storagecommunity.org/easyblog/entry/performance-data-graphical-visualization-for-ibm-spectrum-scale-environment I've been using it for a while - if you have any questions, let me know. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 12 01:00:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sun, 11 Sep 2016 20:00:32 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> Message-ID: <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> After some googling around, I wonder if perhaps what I'm thinking of was an I/O forwarding layer that I understood was being developed for x86_64 type machines rather than some type of GPFS protocol router or proxy. -Aaron On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > A while back I seem to recall hearing about a mechanism being developed > that would function similarly to Lustre's LNET routers and effectively > allow a single set of NSD servers to talk to multiple RDMA fabrics > without requiring the NSD servers to have infiniband interfaces on each > RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each > fabric that would in effect proxy the RDMA requests to the NSD server. > Does anyone know what I'm talking about? Just curious if it's still on > the roadmap. > > -Aaron > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From douglasof at us.ibm.com Mon Sep 12 02:38:08 2016 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Sun, 11 Sep 2016 21:38:08 -0400 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 56, Issue 17 In-Reply-To: References: Message-ID: See you... and anyone else who can make it in Vegas in a couple weeks! From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 09/11/2016 07:00 AM Subject: gpfsug-discuss Digest, Vol 56, Issue 17 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Edge Attendees (Bryan Banister) ----- Message from Bryan Banister on Sat, 10 Sep 2016 21:50:25 +0000 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Edge Attendees Hi Doug, Found out that I get to attend this year. Please put me down for the SS NDA round-table, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Douglas O'flaherty Sent: Monday, August 29, 2016 12:34 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Edge Attendees Greetings: I am organizing an NDA round-table with the IBM Offering Managers at IBM Edge on Tuesday, September 20th at 1pm. The subject will be "The Future of IBM Spectrum Scale." IBM Offering Managers are the Product Owners at IBM. There will be discussions covering licensing, the roadmap for IBM Spectrum Scale RAID (aka GNR), new hardware platforms, etc. This is a unique opportunity to get feedback to the drivers of the IBM Spectrum Scale business plans. It should be a great companion to the content we get from Engineering and Research at most User Group meetings. To get an invitation, please email me privately at douglasof us.ibm.com. All who have a valid NDA are invited. I only need an approximate headcount of attendees. Try not to spam the mailing list. I am pushing to get the Offering Managers to have a similar session at SC16 as an IBM Multi-client Briefing. You can add your voice to that call on this thread, or email me directly. Spectrum Scale User Group at SC16 will once again take place on Sunday afternoon with cocktails to follow. I hope we can blow out the attendance numbers and the number of site speakers we had last year! I know Simon, Bob, and Kristy are already working the agenda. Get your ideas in to them or to me. See you in Vegas, Vegas, SLC, Vegas this Fall... Maybe Australia in between? doug Douglas O'Flaherty IBM Spectrum Storage Marketing Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Mon Sep 12 06:17:05 2016 From: knop at us.ibm.com (Felipe Knop) Date: Mon, 12 Sep 2016 01:17:05 -0400 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: There is a chance the problem might be related to an upgrade from 3.5 to 4.1, or perhaps a remote mount between versions 3.5 and 4.1. It would be useful to know details related to any such migration and different releases when the PMR is opened. Thanks, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Yuri L Volobuev/Austin/IBM at IBMUS To: gpfsug main discussion list Date: 09/09/2016 12:30 PM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org It could be "easy" in the end, e.g. regenerating the key ("mmauth genkey new") may fix the issue. Figuring out exactly what is going wrong is messy though, and requires looking at a number of debug data points, something that's awkward to do on a public mailing list. I don't think you want to post certificates et al on a mailing list. The PMR channel is more appropriate for this kind of thing. yuri "Simon Thompson (Research Computing - IT Services)" ---09/09/2016 07:37:52 AM---That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconf From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list , Date: 09/09/2016 07:37 AM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: on behalf of Yuri L Volobuev Reply-To: "gpfsug-discuss at spectrumscale.org" < gpfsug-discuss at spectrumscale.org> Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" < S.J.Thompson at bham.ac.uk> To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "graycol.gif" deleted by Yuri L Volobuev/Austin/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Mon Sep 12 15:48:56 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 12 Sep 2016 10:48:56 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Perhaps if you clearly describe what equipment and connections you have in place and what you're trying to accomplish, someone on this board can propose a solution. In principle, it's always possible to insert proxies/routers to "fake" any two endpoints into "believing" they are communicating directly. From: Aaron Knister To: Date: 09/11/2016 08:01 PM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org After some googling around, I wonder if perhaps what I'm thinking of was an I/O forwarding layer that I understood was being developed for x86_64 type machines rather than some type of GPFS protocol router or proxy. -Aaron On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > A while back I seem to recall hearing about a mechanism being developed > that would function similarly to Lustre's LNET routers and effectively > allow a single set of NSD servers to talk to multiple RDMA fabrics > without requiring the NSD servers to have infiniband interfaces on each > RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each > fabric that would in effect proxy the RDMA requests to the NSD server. > Does anyone know what I'm talking about? Just curious if it's still on > the roadmap. > > -Aaron > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Mon Sep 12 15:57:55 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Sep 2016 16:57:55 +0200 Subject: [gpfsug-discuss] gpfs 4.2.1 and samba export Message-ID: <20160912145755.xhx2du4c3aimkkxt@ics.muni.cz> Hello, I have GPFS version 4.2.1 on Centos 7.2 (kernel 3.10.0-327.22.2.el7.x86_64) and I have got some weird behavior of samba. Windows clients get stucked for almost 1 minute when copying files. I traced down the problematic syscall: 27887 16:39:28.000401 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.Messaging_1.10.22012.0_x86__8wekyb3d8bbwe/SkypeApp/View/HomePage.xaml", {{1473691167, 940424000}, {1473691168, 295355}}, 0) = 0 <74.999775> [...] 27887 16:44:24.000310 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.Windows.Photos_15.1001.16470.0_x64__8wekyb3d8bbwe/Assets/PhotosAppList.contrast-white_targetsize-16.png", {{1473691463, 931319000}, {1473691464, 96608}}, 0) = 0 <74.999841> [...] 27887 16:50:34.002274 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.XboxApp_9.9.30030.0_x64__8wekyb3d8bbwe/_Resources/50.rsrc", {{1473691833, 952166000}, {1473691834, 2166223}}, 0) = 0 <74.997877> [...] 27887 16:53:11.000240 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.ZuneVideo_3.6.13251.0_x64__8wekyb3d8bbwe/Styles/CommonBrushes.xbf", {{1473691990, 948668000}, {1473691991, 131221}}, 0) = 0 <74.999540> it seems that from time to time, utimensat(2) call takes over 70 (!!) seconds. Normal utimensat syscall looks like: 27887 16:55:16.238132 utimensat(AT_FDCWD, "000000-My_Documents/Windows/Installer/$PatchCache$/Managed/00004109210000000000000000F01FEC/14.0.7015/ACEODDBS.DLL", {{1473692116, 196458000}, {1351702318, 0}}, 0) = 0 <0.000065> At the same time, there is untar running. When samba freezes at utimensat call, untar continues to write data to GPFS (same fs as samba), so it does not seem to me as buffers flush. When the syscall is stucked, I/O utilization of all GPFS disks is below 10 %. mmfsadm dump waiters shows nothing waiting and any cluster node. So any ideas? Or should I just fire PMR? This is cluster config: clusterId 2745894253048382857 autoload no dmapiFileHandleSize 32 minReleaseLevel 4.2.1.0 ccrEnabled yes maxMBpS 20000 maxblocksize 8M cipherList AUTHONLY maxFilesToCache 10000 nsdSmallThreadRatio 1 nsdMaxWorkerThreads 480 ignorePrefetchLUNCount yes pagepool 48G prefetchThreads 320 worker1Threads 320 writebehindThreshhold 10485760 cifsBypassShareLocksOnRename yes cifsBypassTraversalChecking yes allowWriteWithDeleteChild yes adminMode central And this is file system config: flag value description ------------------- ------------------------ ----------------------------------- -f 65536 Minimum fragment size in bytes -i 4096 Inode size in bytes -I 32768 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k all ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 2097152 Block size -Q user;group;fileset Quotas accounting enabled user;group;fileset Quotas enforced none Default quotas enabled --perfileset-quota Yes Per-fileset quota enforcement --filesetdf Yes Fileset df enabled? -V 15.01 (4.2.0.0) File system version --create-time Wed Aug 24 17:38:40 2016 File system creation time -z No Is DMAPI enabled? -L 4194304 Logfile size -E Yes Exact mtime mount option -S No Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 402653184 Maximum number of inodes in all inode spaces --log-replicas 0 Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) -P system Disk storage pools in file system -d nsd_A_m;nsd_B_m;nsd_C_m;nsd_D_m;nsd_A_LV1_d;nsd_A_LV2_d;nsd_A_LV3_d;nsd_A_LV4_d;nsd_B_LV1_d;nsd_B_LV2_d;nsd_B_LV3_d;nsd_B_LV4_d;nsd_C_LV1_d;nsd_C_LV2_d;nsd_C_LV3_d; -d nsd_C_LV4_d;nsd_D_LV1_d;nsd_D_LV2_d;nsd_D_LV3_d;nsd_D_LV4_d Disks in file system -A yes Automatic mount option -o none Additional mount options -T /gpfs/vol1 Default mount point --mount-priority 1 Mount priority -- Luk?? Hejtm?nek From chekh at stanford.edu Mon Sep 12 20:03:15 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 12 Sep 2016 12:03:15 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Message-ID: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu From bbanister at jumptrading.com Mon Sep 12 20:06:59 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 12 Sep 2016 19:06:59 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB0632A645@CHI-EXCHANGEW1.w2k.jumptrading.com> I'd recommend running a mmcheckquota and then check mmlsquota again, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Alex Chekholko Sent: Monday, September 12, 2016 2:03 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 12 20:08:28 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 12 Sep 2016 19:08:28 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Hi Alex, While the numbers don?t match exactly, they?re close enough to prompt me to ask if data replication is possibly set to two? Thanks? Kevin On Sep 12, 2016, at 2:03 PM, Alex Chekholko > wrote: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Mon Sep 12 21:26:51 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 12 Sep 2016 20:26:51 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: My thoughts exactly. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 12 September 2016 20:08 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Hi Alex, While the numbers don?t match exactly, they?re close enough to prompt me to ask if data replication is possibly set to two? Thanks? Kevin On Sep 12, 2016, at 2:03 PM, Alex Chekholko > wrote: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Mon Sep 12 21:46:55 2016 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 12 Sep 2016 21:46:55 +0100 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: <2C38B1C8-66DB-45C6-AA5D-E612F5BFE935@qsplace.co.uk> However replicated files should show up with ls as taking about double the space. I.e. "ls -lash" 49G -r-------- 1 root root 25G Sep 12 21:11 Somefile I know you've said you checked ls vs du for allocated space it might be worth a double check. Also check that you haven't got a load of snapshots, especially if you have high file churn which will create new blocks; although with your figures it'd have to be very high file churn. -- Lauz On 12 September 2016 21:26:51 BST, "Sobey, Richard A" wrote: >My thoughts exactly. > >Richard > >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of >Buterbaugh, Kevin L >Sent: 12 September 2016 20:08 >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] big difference between output of >'mmlsquota' and 'du'? > >Hi Alex, > >While the numbers don?t match exactly, they?re close enough to prompt >me to ask if data replication is possibly set to two? Thanks? > >Kevin > >On Sep 12, 2016, at 2:03 PM, Alex Chekholko >> wrote: > >Hi, > >For a fileset with a quota on it, we have mmlsquota reporting 39TB >utilization (out of 50TB quota), with 0 in_doubt. > >Running a 'du' on the same directory (where the fileset is junctioned) >shows 21TB usage. > >I looked for sparse files (files that report different size via ls vs >du). I looked at 'du --apparent-size ...' > >https://en.wikipedia.org/wiki/Sparse_file > >What else could it be? > >Is there some attribute I can scan for inside GPFS? >Maybe where FILE_SIZE does not equal KB_ALLOCATED? >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm > > >[root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* >3977 /srv/gsfs0/projects/gbsc/Backups >1 /srv/gsfs0/projects/gbsc/benchmark >13109 /srv/gsfs0/projects/gbsc/Billing >198719 /srv/gsfs0/projects/gbsc/Clinical >1 /srv/gsfs0/projects/gbsc/Clinical_Vendors >1206523 /srv/gsfs0/projects/gbsc/Data >1 /srv/gsfs0/projects/gbsc/iPoP >123165 /srv/gsfs0/projects/gbsc/Macrogen >58676 /srv/gsfs0/projects/gbsc/Misc >6625890 /srv/gsfs0/projects/gbsc/mva >1 /srv/gsfs0/projects/gbsc/Proj >17 /srv/gsfs0/projects/gbsc/Projects >3290502 /srv/gsfs0/projects/gbsc/Resources >1 /srv/gsfs0/projects/gbsc/SeqCenter >1 /srv/gsfs0/projects/gbsc/share >514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring >1 /srv/gsfs0/projects/gbsc/TCGA_Variants >267873 /srv/gsfs0/projects/gbsc/tools >9597797 /srv/gsfs0/projects/gbsc/workspace > >(adds up to about 21TB) > >[root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 > Block Limits | File Limits >Filesystem type GB quota limit in_doubt >grace | files quota limit in_doubt grace Remarks >gsfs0 FILESET 39889 51200 51200 0 >none | 1663212 0 0 4 none > > >[root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc >projects.gbsc Linked /srv/gsfs0/projects/gbsc > >Regards, >-- >Alex Chekholko chekh at stanford.edu > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? >Kevin Buterbaugh - Senior System Administrator >Vanderbilt University - Advanced Computing Center for Research and >Education >Kevin.Buterbaugh at vanderbilt.edu >- (615)875-9633 > > > > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Sep 12 22:37:08 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 12 Sep 2016 21:37:08 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Maybe you have a huge file open, that's been unlinked and still growing? -jf -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 12 22:59:36 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 12 Sep 2016 14:59:36 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and'du'? In-Reply-To: References: Message-ID: 'du' tallies up 'blocks allocated', not file sizes. So it shouldn't matter whether any sparse files are present. GPFS doesn't charge quota for data in snapshots (whether it should is a separate question). The observed discrepancy has two plausible causes: 1) Inaccuracy in quota accounting (more likely) 2) Artefacts of data replication (less likely) Running mmcheckquota in this situation would be a good idea. yuri From: Alex Chekholko To: gpfsug-discuss at spectrumscale.org, Date: 09/12/2016 12:04 PM Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From chekh at stanford.edu Mon Sep 12 23:11:12 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 12 Sep 2016 15:11:12 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Thanks for all the responses. I will look through the filesystem clients for open file handles; we have definitely had deleted open log files of multi-TB size before. The filesystem has replication set to 1. We don't use snapshots. I'm running a 'mmrestripefs -r' (some files were ill-placed from aborted pool migrations) and then I will run an 'mmcheckquota'. On 9/12/16 2:37 PM, Jan-Frode Myklebust wrote: > Maybe you have a huge file open, that's been unlinked and still growing? > > > > -jf > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Alex Chekholko chekh at stanford.edu From xhejtman at ics.muni.cz Mon Sep 12 23:30:19 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Sep 2016 00:30:19 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek From volobuev at us.ibm.com Mon Sep 12 23:42:00 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 12 Sep 2016 15:42:00 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: The increasing value of snapId is not a problem. Creating snapshots every 15 min is somewhat more frequent than what is customary, but as long as you're able to delete filesets at the same rate you're creating them, this should work OK. yuri From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org, Date: 09/12/2016 03:30 PM Subject: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Sep 13 04:19:30 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 13 Sep 2016 03:19:30 +0000 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: Don't worry. We do 400+ snapshots every 4 hours and that number is only getting bigger. Don't know what our current snapid count is mind you, can find out when in the office. Get Outlook for Android On Mon, Sep 12, 2016 at 11:30 PM +0100, "Lukas Hejtmanek" > wrote: Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Tue Sep 13 05:06:42 2016 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Tue, 13 Sep 2016 05:06:42 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <7EAC0DD4-6FC1-4DF5-825E-9E2DD966BA4E@qsplace.co.uk> There are many people doing the same thing so nothing to worry about. As your using 4.2.1 you can at least bulk delete the snapshots using a comma separated list, making life just that little bit easier. -- Lauz On 13 September 2016 04:19:30 BST, "Sobey, Richard A" wrote: >Don't worry. We do 400+ snapshots every 4 hours and that number is only >getting bigger. Don't know what our current snapid count is mind you, >can find out when in the office. > >Get Outlook for Android > > > >On Mon, Sep 12, 2016 at 11:30 PM +0100, "Lukas Hejtmanek" >> wrote: > >Hello, > >using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 >minutes >during working hours). It seems that snapid is increasing only number. >Should >I be fine with such a number of snapshots per day? I guess we could >reach >snapid 100,000. I remove all these snapshots during night so I do not >keep >huge number of snapshots. > >-- >Luk?? Hejtm?nek >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Valdis.Kletnieks at vt.edu Tue Sep 13 05:32:24 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Tue, 13 Sep 2016 00:32:24 -0400 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <20635.1473741144@turing-police.cc.vt.edu> On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: > I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. From jtucker at pixitmedia.com Tue Sep 13 10:10:02 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Tue, 13 Sep 2016 10:10:02 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20635.1473741144@turing-police.cc.vt.edu> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> <20635.1473741144@turing-police.cc.vt.edu> Message-ID: <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: > On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: >> I guess we could reach snapid 100,000. > It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. > > What you *do* want to do is make the snap *name* meaningful, using > a timestamp or something to keep your sanity. > > mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Tue Sep 13 21:51:16 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 13 Sep 2016 13:51:16 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org Hey Yuri, ? Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From xhejtman at ics.muni.cz Tue Sep 13 21:57:52 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Sep 2016 22:57:52 +0200 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <20160913205752.3lmmfbhm25mu77j4@ics.muni.cz> Yuri et al. thank you for answers, I should be fine with snapshots as you suggest. On Mon, Sep 12, 2016 at 03:42:00PM -0700, Yuri L Volobuev wrote: > The increasing value of snapId is not a problem. Creating snapshots every > 15 min is somewhat more frequent than what is customary, but as long as > you're able to delete filesets at the same rate you're creating them, this > should work OK. > > yuri > > > > From: Lukas Hejtmanek > To: gpfsug-discuss at spectrumscale.org, > Date: 09/12/2016 03:30 PM > Subject: [gpfsug-discuss] gpfs snapshots > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hello, > > using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 > minutes > during working hours). It seems that snapid is increasing only number. > Should > I be fine with such a number of snapshots per day? I guess we could reach > snapid 100,000. I remove all these snapshots during night so I do not keep > huge number of snapshots. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek From S.J.Thompson at bham.ac.uk Tue Sep 13 22:21:59 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Tue, 13 Sep 2016 21:21:59 +0000 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com>, Message-ID: I thought the GUI implemented some form of snapshot scheduler. Personal opinion is that is the wrong place and I agree that is should be core functionality to ensure that the scheduler is running properly. But I would suggest that it might be more than just snapshots people might want to schedule. E.g. An ilm pool flush. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yuri L Volobuev [volobuev at us.ibm.com] Sent: 13 September 2016 21:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs snapshots Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri [Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is]Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From mark.bergman at uphs.upenn.edu Tue Sep 13 22:23:57 2016 From: mark.bergman at uphs.upenn.edu (mark.bergman at uphs.upenn.edu) Date: Tue, 13 Sep 2016 17:23:57 -0400 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: Your message of "Tue, 13 Sep 2016 13:51:16 -0700." References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: <19294-1473801837.563347@J_5h.TM7K.YXzn> In the message dated: Tue, 13 Sep 2016 13:51:16 -0700, The pithy ruminations from Yuri L Volobuev on were: => => Hi Jez, => => It sounds to me like the functionality that you're _really_ looking for is => an ability to to do automated snapshot management, similar to what's Yep. => available on other storage systems. For example, "create a new snapshot of => filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many Or, take a snapshot every 15min, keep the 4 most recent, expire all except 4 that were created within 6hrs, only 4 created between 6:01-24:00 hh:mm ago, and expire all-but-2 created between 24:01-48:00, etc, as we do. => examples of sysadmins rolling their own snapshot management system along => those lines, and an ability to add an expiration string as a snapshot I'd be glad to distribute our local example of this exercise. => "comment" appears to be merely an aid in keeping such DIY snapshot => management scripts a bit simpler -- not by much though. The end user would => still be on the hook for some heavy lifting, in particular figuring out a => way to run an equivalent of a cluster-aware cron with acceptable fault => tolerance semantics. That is, if a snapshot creation is scheduled, only => one node in the cluster should attempt to create the snapshot, but if that => node fails, another node needs to step in (as opposed to skipping the => scheduled snapshot creation). This is doable outside of GPFS, of course, => but is not trivial. Architecturally, the right place to implement a Ah, that part really is trivial....In our case, the snapshot program takes the filesystem name as an argument... we simply rely on the GPFS fault detection/failover. The job itself runs (via cron) on every GPFS server node, but only creates the snapshot on the server that is the active manager for the specified filesystem: ############################################################################## # Check if the node where this script is running is the GPFS manager node for the # specified filesystem manager=`/usr/lpp/mmfs/bin/mmlsmgr $filesys | grep -w "^$filesys" |awk '{print $2}'` ip addr list | grep -qw "$manager" if [ $? != 0 ] ; then # This node is not the manager...exit exit fi # else ... continue and create the snapshot ################################################################################################### => => yuri => => -- Mark Bergman voice: 215-746-4061 mark.bergman at uphs.upenn.edu fax: 215-614-0266 http://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania PGP Key: http://www.cbica.upenn.edu/sbia/bergman From jtolson at us.ibm.com Tue Sep 13 22:47:02 2016 From: jtolson at us.ibm.com (John T Olson) Date: Tue, 13 Sep 2016 14:47:02 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu><00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com>, Message-ID: We do have a general-purpose scheduler on the books as an item that is needed for a future release and as Yuri mentioned it would be cluster wide to avoid the single point of failure with tools like Cron. However, it's one of many things we want to try to get into the product and so we don't have any definite timeline yet. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list Date: 09/13/2016 02:22 PM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org I thought the GUI implemented some form of snapshot scheduler. Personal opinion is that is the wrong place and I agree that is should be core functionality to ensure that the scheduler is running properly. But I would suggest that it might be more than just snapshots people might want to schedule. E.g. An ilm pool flush. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yuri L Volobuev [volobuev at us.ibm.com] Sent: 13 September 2016 21:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs snapshots Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri [Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is]Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss (See attached file: graycol.gif) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From jtucker at pixitmedia.com Tue Sep 13 23:28:22 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Tue, 13 Sep 2016 23:28:22 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> <20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: <2336bbd5-39ca-dc0d-e1b4-7a301c6b9f2e@pixitmedia.com> Hey So yes, you're quite right - we have higher order fault tolerant cluster wide methods of dealing with such requirements already. However, I still think the end user should be empowered to be able construct such methods themselves if needs be. Yes, the comment is merely an aid [but also useful as a generic comment field] and as such could be utilised to encode basic metadata into the comment field. I'll log an RFE and see where we go from here. Cheers Jez On 13/09/16 21:51, Yuri L Volobuev wrote: > > Hi Jez, > > It sounds to me like the functionality that you're _really_ looking > for is an ability to to do automated snapshot management, similar to > what's available on other storage systems. For example, "create a new > snapshot of filesets X, Y, Z every 30 min, keep the last 16 > snapshots". I've seen many examples of sysadmins rolling their own > snapshot management system along those lines, and an ability to add an > expiration string as a snapshot "comment" appears to be merely an aid > in keeping such DIY snapshot management scripts a bit simpler -- not > by much though. The end user would still be on the hook for some heavy > lifting, in particular figuring out a way to run an equivalent of a > cluster-aware cron with acceptable fault tolerance semantics. That is, > if a snapshot creation is scheduled, only one node in the cluster > should attempt to create the snapshot, but if that node fails, another > node needs to step in (as opposed to skipping the scheduled snapshot > creation). This is doable outside of GPFS, of course, but is not > trivial. Architecturally, the right place to implement a > fault-tolerant cluster-aware scheduling framework is GPFS itself, as > the most complex pieces are already there. We have some plans for work > along those lines, but if you want to reinforce the point with an RFE, > that would be fine, too. > > yuri > > Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey > Yuri, Perhaps an RFE here, but could I suggest there isJez Tucker > ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I > suggest there is much value in > > From: Jez Tucker > To: gpfsug-discuss at spectrumscale.org, > Date: 09/13/2016 02:10 AM > Subject: Re: [gpfsug-discuss] gpfs snapshots > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Hey Yuri, > > Perhaps an RFE here, but could I suggest there is much value in > adding a -c option to mmcrsnapshot? > > Use cases: > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "Before phase 2" > > and > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "expire:GMT-2017.04.21-16.00.00" > > Ideally also: mmcrsnapshot fs1 > fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr > > Then it's easy to iterate over snapshots and subsequently > mmdelsnapshot snaps which are no longer required. > There are lots of methods to achieve this, but without external > databases / suchlike, this is rather simple and effective for end users. > > Alternatively a second comment like -expire flag as user metadata may > be preferential. > > Thoughts? > > Jez > > > On 13/09/16 05:32, _Valdis.Kletnieks at vt.edu_ > wrote: > > On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: > I guess we could reach snapid 100,000. > > It probably stores the snap ID as a 32 or 64 bit int, so 100K > is peanuts. > > What you *do* want to do is make the snap *name* meaningful, using > a timestamp or something to keep your sanity. > > mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Jez Tucker > Head of Research & Product Development > Pixit Media_ > __www.pixitmedia.com_ > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other > person. Please notify the sender immediately and delete this email > from your computer system. Any opinions expressed are not necessarily > those of the company from which this email was sent and, whilst to the > best of our knowledge no viruses or defects exist, no responsibility > can be accepted for any loss or damage arising from its receipt or > subsequent use of this > email._______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media Mobile: +44 (0) 776 419 3820 www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From service at metamodul.com Wed Sep 14 19:10:37 2016 From: service at metamodul.com (service at metamodul.com) Date: Wed, 14 Sep 2016 20:10:37 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: Why not use a GPFS user extented attribut for that ? In a certain way i see GPFS as a database. ^_^ Hajo Von Samsung Mobile gesendet
-------- Urspr?ngliche Nachricht --------
Von: Jez Tucker
Datum:2016.09.13 11:10 (GMT+01:00)
An: gpfsug-discuss at spectrumscale.org
Betreff: Re: [gpfsug-discuss] gpfs snapshots
Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Wed Sep 14 19:21:20 2016 From: service at metamodul.com (service at metamodul.com) Date: Wed, 14 Sep 2016 20:21:20 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: <4fojjlpuwqoalkffaahy7snf.1473877280415@email.android.com> I am missing since ages such a framework. I had my simple one devoloped on the gpfs callbacks which allowed me to have a centralized cron (HA) up to oracle also ?high available and ha nfs on Aix. Hajo Universal Inventor? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Wed Sep 14 19:49:36 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Wed, 14 Sep 2016 19:49:36 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: Message-ID: Hi I still think I'm coming down on the side of simplistic ease of use: Example: [jtucker at pixstor ~]# mmlssnapshot mmfs1 Snapshots in file system mmfs1: Directory SnapId Status Created Fileset Comment @GMT-2016.09.13-23.00.14 551 Valid Wed Sep 14 00:00:02 2016 myproject Prior to phase 1 @GMT-2016.09.14-05.00.14 552 Valid Wed Sep 14 06:00:01 2016 myproject Added this and that @GMT-2016.09.14-11.00.14 553 Valid Wed Sep 14 12:00:01 2016 myproject Merged project2 @GMT-2016.09.14-17.00.14 554 Valid Wed Sep 14 18:00:02 2016 myproject Before clean of .xmp @GMT-2016.09.14-17.05.30 555 Valid Wed Sep 14 18:05:03 2016 myproject Archival Jez On 14/09/16 19:10, service at metamodul.com wrote: > Why not use a GPFS user extented attribut for that ? > In a certain way i see GPFS as a database. ^_^ > Hajo > > > > Von Samsung Mobile gesendet > > > -------- Urspr?ngliche Nachricht -------- > Von: Jez Tucker > Datum:2016.09.13 11:10 (GMT+01:00) > An: gpfsug-discuss at spectrumscale.org > Betreff: Re: [gpfsug-discuss] gpfs snapshots > > Hey Yuri, > > Perhaps an RFE here, but could I suggest there is much value in > adding a -c option to mmcrsnapshot? > > Use cases: > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "Before phase 2" > > and > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "expire:GMT-2017.04.21-16.00.00" > > Ideally also: mmcrsnapshot fs1 > fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr > > Then it's easy to iterate over snapshots and subsequently > mmdelsnapshot snaps which are no longer required. > There are lots of methods to achieve this, but without external > databases / suchlike, this is rather simple and effective for end users. > > Alternatively a second comment like -expire flag as user metadata may > be preferential. > > Thoughts? > > Jez > > > On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: >> On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: >>> I guess we could reach snapid 100,000. >> It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. >> >> What you *do* want to do is make the snap *name* meaningful, using >> a timestamp or something to keep your sanity. >> >> mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Jez Tucker > Head of Research & Product Development > Pixit Media > www.pixitmedia.com > -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From secretary at gpfsug.org Thu Sep 15 09:42:54 2016 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 15 Sep 2016 09:42:54 +0100 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Message-ID: Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 [1] Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.botcherby at kcl.ac.uk Thu Sep 15 09:45:47 2016 From: peter.botcherby at kcl.ac.uk (Botcherby, Peter) Date: Thu, 15 Sep 2016 08:45:47 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: Hi Claire, Hope you are well - I will be away for this as going to Indonesia on the 18th October for my nephew?s wedding. Regards Peter From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Secretary GPFS UG Sent: 15 September 2016 09:43 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mimarsh2 at vt.edu Thu Sep 15 17:49:27 2016 From: mimarsh2 at vt.edu (Brian Marshall) Date: Thu, 15 Sep 2016 12:49:27 -0400 Subject: [gpfsug-discuss] EDR and omnipath Message-ID: All, I see in the GPFS FAQ A6.3 the statement below. Is it possible to have GPFS do RDMA over EDR infiniband and non-RDMA communication over omnipath (IP over fabric) when each NSD server has an EDR card and a OPA card installed? RDMA is not supported on a node when both Mellanox HCAs and Intel Omni-Path HFIs are enabled for RDMA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Sep 15 16:33:17 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 15 Sep 2016 15:33:17 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Message-ID: Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname -j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It's mildly irritating to say the least! Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Sep 16 15:21:58 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 16 Sep 2016 10:21:58 -0400 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" To: "'gpfsug-discuss at spectrumscale.org'" Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Sep 16 15:40:52 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 16 Sep 2016 14:40:52 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: Thanks Marc. Regrettably in this case, the only way I knew to delete a snapshot (listed below) has broken going from 3.5 to 4.2.1. Creating snaps has suffered the same fate. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: 16 September 2016 15:22 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Sanchez at deshaw.com Fri Sep 16 20:49:14 2016 From: Paul.Sanchez at deshaw.com (Sanchez, Paul) Date: Fri, 16 Sep 2016 19:49:14 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: <3e1f02b30e1a49ef950de7910801f5d1@mbxtoa1.winmail.deshaw.com> The old syntax works unless have a colon in your snapshot names. In that case, the portion before the first colon will be interpreted as a fileset name. So if you use RFC 3339/ISO 8601 date/times, that?s a problem: The syntax for creating and deleting snapshots goes from this: mm{cr|del}snapshot fs100 SNAP at 2016-07-31T13:00:07Z ?j 1000466 to this: mm{cr|del}snapshot fs100 1000466:SNAP at 2016-07-31T13:00:07Z If you are dealing with filesystem level snapshots then you just need a leading colon: mm{cr|del}snapshot fs100 :SNAP at 2016-07-31T13:00:07Z Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Friday, September 16, 2016 10:22 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Mon Sep 19 08:11:38 2016 From: viccornell at gmail.com (Vic Cornell) Date: Mon, 19 Sep 2016 08:11:38 +0100 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> Bump I can see no reason why that wouldn't work. But it would be nice to a have an official answer or evidence that it works. Vic > On 15 Sep 2016, at 5:49 pm, Brian Marshall wrote: > > All, > > I see in the GPFS FAQ A6.3 the statement below. Is it possible to have GPFS do RDMA over EDR infiniband and non-RDMA communication over omnipath (IP over fabric) when each NSD server has an EDR card and a OPA card installed? > > > > RDMA is not supported on a node when both Mellanox HCAs and Intel Omni-Path HFIs are enabled for RDMA. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Sep 19 20:18:18 2016 From: mweil at wustl.edu (Matt Weil) Date: Mon, 19 Sep 2016 14:18:18 -0500 Subject: [gpfsug-discuss] increasing inode Message-ID: All, What exactly happens that makes the clients hang when a file set inodes are increased? ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From aaron.s.knister at nasa.gov Mon Sep 19 21:34:53 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 16:34:53 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> References: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> Message-ID: <6f33c20f-ff8f-9ccc-4609-920b35058138@nasa.gov> I must admit, I'm curious as to why one cannot use GPFS with IB and OPA both in RDMA mode. Granted, I know very little about OPA but if it just presents as another verbs device I wonder why it wouldn't "Just work" as long as GPFS is configured correctly. -Aaron On 9/19/16 3:11 AM, Vic Cornell wrote: > Bump > > I can see no reason why that wouldn't work. But it would be nice to a > have an official answer or evidence that it works. > > Vic > > >> On 15 Sep 2016, at 5:49 pm, Brian Marshall > > wrote: >> >> All, >> >> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >> omnipath (IP over fabric) when each NSD server has an EDR card and a >> OPA card installed? >> >> >> >> RDMA is not supported on a node when both Mellanox HCAs and Intel >> Omni-Path HFIs are enabled for RDMA. >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at us.ibm.com Mon Sep 19 21:43:31 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Mon, 19 Sep 2016 20:43:31 +0000 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: <6f33c20f-ff8f-9ccc-4609-920b35058138@nasa.gov> Message-ID: Because they both require a different distribution of OFED, which are mutual exclusive to install. in theory if you deploy plain OFED it might work, but that will be hard to find somebody to support. Sent from IBM Verse Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- From:"Aaron Knister" To:gpfsug-discuss at spectrumscale.orgDate:Mon, Sep 19, 2016 1:35 PMSubject:Re: [gpfsug-discuss] EDR and omnipath I must admit, I'm curious as to why one cannot use GPFS with IB and OPA both in RDMA mode. Granted, I know very little about OPA but if it just presents as another verbs device I wonder why it wouldn't "Just work" as long as GPFS is configured correctly.-AaronOn 9/19/16 3:11 AM, Vic Cornell wrote:> Bump>> I can see no reason why that wouldn't work. But it would be nice to a> have an official answer or evidence that it works.>> Vic>>>> On 15 Sep 2016, at 5:49 pm, Brian Marshall > > wrote:>>>> All,>>>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over>> omnipath (IP over fabric) when each NSD server has an EDR card and a>> OPA card installed?>>>>>>>> RDMA is not supported on a node when both Mellanox HCAs and Intel>> Omni-Path HFIs are enabled for RDMA.>>>> _______________________________________________>> gpfsug-discuss mailing list>> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>>>> _______________________________________________> gpfsug-discuss mailing list> gpfsug-discuss at spectrumscale.org> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>-- Aaron KnisterNASA Center for Climate Simulation (Code 606.2)Goddard Space Flight Center(301) 286-2776_______________________________________________gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 19 21:55:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 16:55:32 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: Ah, that makes complete sense. Thanks! I had been doing some reading about OmniPath and for some reason was under the impression the OmniPath adapter could load itself as a driver under the verbs stack of OFED. Even so, that raises support concerns as you say. I wonder what folks are doing who have IB-based block storage fabrics but wanting to connect to OmniPath-based fabrics? I'm also curious how GNR customers would be able to serve both IB-based and an OmniPath-based fabrics over RDMA where performance is best? This is is along the lines of my GPFS protocol router question from the other day. -Aaron On 9/19/16 4:43 PM, Sven Oehme wrote: > Because they both require a different distribution of OFED, which are > mutual exclusive to install. > in theory if you deploy plain OFED it might work, but that will be hard > to find somebody to support. > > > Sent from IBM Verse > > Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- > > From: "Aaron Knister" > To: gpfsug-discuss at spectrumscale.org > Date: Mon, Sep 19, 2016 1:35 PM > Subject: Re: [gpfsug-discuss] EDR and omnipath > > ------------------------------------------------------------------------ > > I must admit, I'm curious as to why one cannot use GPFS with IB and OPA > both in RDMA mode. Granted, I know very little about OPA but if it just > presents as another verbs device I wonder why it wouldn't "Just work" as > long as GPFS is configured correctly. > > -Aaron > > On 9/19/16 3:11 AM, Vic Cornell wrote: >> Bump >> >> I can see no reason why that wouldn't work. But it would be nice to a >> have an official answer or evidence that it works. >> >> Vic >> >> >>> On 15 Sep 2016, at 5:49 pm, Brian Marshall >> > wrote: >>> >>> All, >>> >>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >>> omnipath (IP over fabric) when each NSD server has an EDR card and a >>> OPA card installed? >>> >>> >>> >>> RDMA is not supported on a node when both Mellanox HCAs and Intel >>> Omni-Path HFIs are enabled for RDMA. >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Sep 19 22:03:51 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 17:03:51 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: <99103c73-baf0-f421-f64d-1d5ee916d340@nasa.gov> Here's where I read about the inter-operability of the two: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/omni-path-storage-white-paper.pdf This is what Intel says: > In a multi-homed file system server, or in a Lustre Networking (LNet) or IP router, a single OpenFabrics Al- liance (OFA) software environment supporting both an Intel OPA HFI and a Mellanox* InfiniBand HCA is required. The OFA software stack is architected to support multiple tar- geted network types. Currently, the OFA stack simultaneously supports iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and InfiniBand networks, and the Intel OPA network has been added to that list. As the OS distributions implement their OFA stacks, it will be validated to simultaneously support both Intel OPA Host > Intel is working closely with the major Linux distributors, including Red Hat* and SUSE*, to ensure that Intel OPA support is integrated into their OFA implementation. Once this is accomplished, then simultaneous Mellanox InfiniBand and Intel OPA support will be present in the standard Linux distributions. So it seems as though Intel is relying on the OS vendors to bridge the support gap between them and Mellanox. -Aaron On 9/19/16 4:55 PM, Aaron Knister wrote: > Ah, that makes complete sense. Thanks! > > I had been doing some reading about OmniPath and for some reason was > under the impression the OmniPath adapter could load itself as a driver > under the verbs stack of OFED. Even so, that raises support concerns as > you say. > > I wonder what folks are doing who have IB-based block storage fabrics > but wanting to connect to OmniPath-based fabrics? > > I'm also curious how GNR customers would be able to serve both IB-based > and an OmniPath-based fabrics over RDMA where performance is best? This > is is along the lines of my GPFS protocol router question from the other > day. > > -Aaron > > On 9/19/16 4:43 PM, Sven Oehme wrote: >> Because they both require a different distribution of OFED, which are >> mutual exclusive to install. >> in theory if you deploy plain OFED it might work, but that will be hard >> to find somebody to support. >> >> >> Sent from IBM Verse >> >> Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- >> >> From: "Aaron Knister" >> To: gpfsug-discuss at spectrumscale.org >> Date: Mon, Sep 19, 2016 1:35 PM >> Subject: Re: [gpfsug-discuss] EDR and omnipath >> >> ------------------------------------------------------------------------ >> >> I must admit, I'm curious as to why one cannot use GPFS with IB and OPA >> both in RDMA mode. Granted, I know very little about OPA but if it just >> presents as another verbs device I wonder why it wouldn't "Just work" as >> long as GPFS is configured correctly. >> >> -Aaron >> >> On 9/19/16 3:11 AM, Vic Cornell wrote: >>> Bump >>> >>> I can see no reason why that wouldn't work. But it would be nice to a >>> have an official answer or evidence that it works. >>> >>> Vic >>> >>> >>>> On 15 Sep 2016, at 5:49 pm, Brian Marshall >>> > wrote: >>>> >>>> All, >>>> >>>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >>>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >>>> omnipath (IP over fabric) when each NSD server has an EDR card and a >>>> OPA card installed? >>>> >>>> >>>> >>>> RDMA is not supported on a node when both Mellanox HCAs and Intel >>>> Omni-Path HFIs are enabled for RDMA. >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Tue Sep 20 14:22:51 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 20 Sep 2016 09:22:51 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Tue Sep 20 15:01:49 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 20 Sep 2016 10:01:49 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov><712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 20 15:07:38 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Tue, 20 Sep 2016 14:07:38 +0000 Subject: [gpfsug-discuss] GPFS Routers References: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D24844@NDMSMBX404.ndc.nasa.gov> Not sure if this image will go through but here's one I found: [X] The "Routers" are LNET routers. LNET is just the name of lustre's network stack. The LNET routers "route" the Lustre protocol between disparate network types (quadrics, Ethernet, myrinet, carrier pigeon). Packet loss on carrier pigeon is particularly brutal, though. From: Marc A Kaplan Sent: 9/20/16, 10:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 20 15:08:46 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Tue, 20 Sep 2016 14:08:46 +0000 Subject: [gpfsug-discuss] GPFS Routers References: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> Looks like the attachment got scrubbed. Here's the link http://docplayer.net/docs-images/39/19199001/images/7-0.png[X] From: aaron.s.knister at nasa.gov Sent: 9/20/16, 10:07 AM To: gpfsug main discussion list, gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Not sure if this image will go through but here's one I found: [X] The "Routers" are LNET routers. LNET is just the name of lustre's network stack. The LNET routers "route" the Lustre protocol between disparate network types (quadrics, Ethernet, myrinet, carrier pigeon). Packet loss on carrier pigeon is particularly brutal, though. From: Marc A Kaplan Sent: 9/20/16, 10:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 20 15:30:43 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 20 Sep 2016 10:30:43 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> References: [gpfsug-discuss] GPFS Routers <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> Message-ID: Thanks. That example is simpler than I imagined. Question: If that was indeed your situation and you could afford it, why not just go totally with infiniband switching/routing? Are not the routers just a hack to connect Intel OPA to IB? Ref: https://community.mellanox.com/docs/DOC-2384#jive_content_id_Network_Topology_Design -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Sep 20 16:07:12 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 20 Sep 2016 17:07:12 +0200 Subject: [gpfsug-discuss] CES and nfs pseudo root Message-ID: <20160920150712.2v73hsf7pzrqb3g4@ics.muni.cz> Hello, ganesha allows to specify pseudo root for each export using Pseudo="path". mmnfs export sets pseudo path the same as export dir, e.g., I want to export /mnt/nfs, Pseudo is set to '/mnt/nfs' as well. Can I set somehow Pseudo to '/'? -- Luk?? Hejtm?nek From stef.coene at docum.org Tue Sep 20 18:42:57 2016 From: stef.coene at docum.org (Stef Coene) Date: Tue, 20 Sep 2016 19:42:57 +0200 Subject: [gpfsug-discuss] Ubuntu client Message-ID: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Hi, I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. But I also need 2 clients on Ubuntu 14.04. I installed the GPFS client on the Ubuntu server and used mmbuildgpl to build the required kernel modules. ssh keys are exchanged between GPFS servers and the client. But I can't add the node: [root at gpfs01 ~]# mmaddnode -N client1 Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 mmremote: The CCR environment could not be initialized on node client1. mmaddnode: The CCR environment could not be initialized on node client1. mmaddnode: mmaddnode quitting. None of the specified nodes are valid. mmaddnode: Command failed. Examine previous error messages to determine cause. I don't see any error in /var/mmfs on client and server. What can I try to debug this error? Stef From stef.coene at docum.org Tue Sep 20 18:47:47 2016 From: stef.coene at docum.org (Stef Coene) Date: Tue, 20 Sep 2016 19:47:47 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: <3727524d-aa94-a09e-ebf7-a5d4e1c6f301@docum.org> On 09/20/2016 07:42 PM, Stef Coene wrote: > Hi, > > I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. > But I also need 2 clients on Ubuntu 14.04. > I installed the GPFS client on the Ubuntu server and used mmbuildgpl to > build the required kernel modules. > ssh keys are exchanged between GPFS servers and the client. > > But I can't add the node: > [root at gpfs01 ~]# mmaddnode -N client1 > Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 > mmremote: The CCR environment could not be initialized on node client1. > mmaddnode: The CCR environment could not be initialized on node client1. > mmaddnode: mmaddnode quitting. None of the specified nodes are valid. > mmaddnode: Command failed. Examine previous error messages to determine > cause. > > I don't see any error in /var/mmfs on client and server. > > What can I try to debug this error? Pfff, problem solved. I tailed the logs in /var/adm/ras and found out there was a type in /etc/hosts so the hostname of the client was unresolvable. Stef From YARD at il.ibm.com Tue Sep 20 20:03:39 2016 From: YARD at il.ibm.com (Yaron Daniel) Date: Tue, 20 Sep 2016 22:03:39 +0300 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: Hi Check that kernel symbols are installed too Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Stef Coene To: gpfsug main discussion list Date: 09/20/2016 08:43 PM Subject: [gpfsug-discuss] Ubuntu client Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. But I also need 2 clients on Ubuntu 14.04. I installed the GPFS client on the Ubuntu server and used mmbuildgpl to build the required kernel modules. ssh keys are exchanged between GPFS servers and the client. But I can't add the node: [root at gpfs01 ~]# mmaddnode -N client1 Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 mmremote: The CCR environment could not be initialized on node client1. mmaddnode: The CCR environment could not be initialized on node client1. mmaddnode: mmaddnode quitting. None of the specified nodes are valid. mmaddnode: Command failed. Examine previous error messages to determine cause. I don't see any error in /var/mmfs on client and server. What can I try to debug this error? Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Wed Sep 21 04:35:57 2016 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 21 Sep 2016 05:35:57 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From stef.coene at docum.org Wed Sep 21 07:03:05 2016 From: stef.coene at docum.org (Stef Coene) Date: Wed, 21 Sep 2016 08:03:05 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: <01a37d7a-b5ef-cb3e-5ccb-d5f942df6487@docum.org> On 09/21/2016 05:35 AM, Olaf Weiser wrote: > CCR issues are often related to DNS issues, so check, that you Ubuntu > nodes can resolve the existing nodes accordingly and vise versa > in one line: .. all nodes must be resolvable on every node It was a type in the hostname and /etc/hosts. So problem solved. Stef From xhejtman at ics.muni.cz Wed Sep 21 20:09:32 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 21 Sep 2016 21:09:32 +0200 Subject: [gpfsug-discuss] CES NFS with Kerberos Message-ID: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> Hello, does nfs server (ganesha) work for someone with Kerberos authentication? I got random permission denied: :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s dd: error writing ?testf?: Permission denied 51949+0 records in 51948+0 records out 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s dd: error writing ?testf?: Permission denied 30076+0 records in 30075+0 records out 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s dd: error writing ?testf?: Permission denied 89837+0 records in 89836+0 records out 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s It seems that it is a bug in ganesha: http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 but it is still not resolved. -- Luk?? Hejtm?nek From Greg.Lehmann at csiro.au Wed Sep 21 23:34:09 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 21 Sep 2016 22:34:09 +0000 Subject: [gpfsug-discuss] CES NFS with Kerberos In-Reply-To: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> References: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> Message-ID: <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> It may not be NFS. Check your GPFS logs too. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lukas Hejtmanek Sent: Thursday, 22 September 2016 5:10 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] CES NFS with Kerberos Hello, does nfs server (ganesha) work for someone with Kerberos authentication? I got random permission denied: :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s dd: error writing ?testf?: Permission denied 51949+0 records in 51948+0 records out 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s dd: error writing ?testf?: Permission denied 30076+0 records in 30075+0 records out 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s dd: error writing ?testf?: Permission denied 89837+0 records in 89836+0 records out 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s It seems that it is a bug in ganesha: http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 but it is still not resolved. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From xhejtman at ics.muni.cz Thu Sep 22 09:25:09 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 22 Sep 2016 10:25:09 +0200 Subject: [gpfsug-discuss] CES NFS with Kerberos In-Reply-To: <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> References: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> Message-ID: <20160922082509.rc53tseeovjnixtz@ics.muni.cz> Hello, thanks, I do not see any error in GPFS logs. The link, I posted below is not related to GPFS at all, it seems that it is bug in ganesha. On Wed, Sep 21, 2016 at 10:34:09PM +0000, Greg.Lehmann at csiro.au wrote: > It may not be NFS. Check your GPFS logs too. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lukas Hejtmanek > Sent: Thursday, 22 September 2016 5:10 AM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] CES NFS with Kerberos > > Hello, > > does nfs server (ganesha) work for someone with Kerberos authentication? > > I got random permission denied: > :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s > dd: error writing ?testf?: Permission denied > 51949+0 records in > 51948+0 records out > 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s > dd: error writing ?testf?: Permission denied > 30076+0 records in > 30075+0 records out > 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s > dd: error writing ?testf?: Permission denied > 89837+0 records in > 89836+0 records out > 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s > > It seems that it is a bug in ganesha: > http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 > > but it is still not resolved. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek From stef.coene at docum.org Thu Sep 22 19:36:48 2016 From: stef.coene at docum.org (Stef Coene) Date: Thu, 22 Sep 2016 20:36:48 +0200 Subject: [gpfsug-discuss] Blocksize Message-ID: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef From eric.wonderley at vt.edu Thu Sep 22 20:07:30 2016 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 22 Sep 2016 15:07:30 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: It defaults to 4k: mmlsfs testbs8M -i flag value description ------------------- ------------------------ ----------------------------------- -i 4096 Inode size in bytes I think you can make as small as 512b. Gpfs will store very small files in the inode. Typically you want your average file size to be your blocksize and your filesystem has one blocksize and one inodesize. On Thu, Sep 22, 2016 at 2:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Sep 22 20:19:00 2016 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 22 Sep 2016 19:19:00 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <9DA9EC7A281AC7428A9618AFDC49049958EFBB06@CIO-KRC-D1MBX02.osuad.osu.edu> This is a great idea. However there are quite a few other things to consider: -max file count? If you need say a couple of billion files, this will affect things. -wish to store small files in the system pool in late model SS/GPFS? -encryption? No data will be stored in the system pool so large blocks for small file storage in system is pointless. -system pool replication? -HDD vs SSD for system pool? -xxD or array tuning recommendations from your vendor? -streaming vs random IO? Do you have a single dedicated app that has performance like xxx? -probably more I can't think of off the top of my head. etc etc Ed ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Stef Coene [stef.coene at docum.org] Sent: Thursday, September 22, 2016 2:36 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Blocksize Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Thu Sep 22 20:25:03 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 22 Sep 2016 21:25:03 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Thu Sep 22 20:29:43 2016 From: stef.coene at docum.org (Stef Coene) Date: Thu, 22 Sep 2016 21:29:43 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef From luis.bolinches at fi.ibm.com Thu Sep 22 20:37:02 2016 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 22 Sep 2016 19:37:02 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: , <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 22 21:02:24 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 22 Sep 2016 20:02:24 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: We are down to our last few places, so if you do intend to attend, I encourage you to register now! Simon ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Secretary GPFS UG [secretary at gpfsug.org] Sent: 15 September 2016 09:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 22 21:02:24 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 22 Sep 2016 20:02:24 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: We are down to our last few places, so if you do intend to attend, I encourage you to register now! Simon ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Secretary GPFS UG [secretary at gpfsug.org] Sent: 15 September 2016 09:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 22 21:25:10 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 22 Sep 2016 16:25:10 -0400 Subject: [gpfsug-discuss] Blocksize and space and performance for Metadata, release 4.2.x In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: There have been a few changes over the years that may invalidate some of the old advice about metadata and disk allocations there for. These have been phased in over the last few years, I am discussing the present situation for release 4.2.x 1) Inode size. Used to be 512. Now you can set the inodesize at mmcrfs time. Defaults to 4096. 2) Data in inode. If it fits, then the inode holds the data. Since a 512 byte inode still works, you can have more than 3.5KB of data in a 4KB inode. 3) Extended Attributes in Inode. Again, if it fits... Extended attributes used to be stored in a separate file of metadata. So extended attributes performance is way better than the old days. 4) (small) Directories in Inode. If it fits, the inode of a directory can hold the directory entries. That gives you about 2x performance on directory reads, for smallish directories. 5) Big directory blocks. Directories used to use a maximum of 32KB per block, potentially wasting a lot of space and yielding poor performance for large directories. Now directory blocks are the lesser of metadata-blocksize and 256KB. 6) Big directories are shrinkable. Used to be directories would grow in 32KB chunks but never shrink. Yup, even an almost(?) "empty" directory would remain the size the directory had to be at its lifetime maximum. That means just a few remaining entries could be "sprinkled" over many directory blocks. (See also 5.) But now directories autoshrink to avoid wasteful sparsity. Last I looked, the implementation just stopped short of "pushing" tiny directories back into the inode. But a huge directory can be shrunk down to a single (meta)data block. (See --compact in the docs.) --marc of GPFS -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Thu Sep 22 21:49:32 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 22 Sep 2016 13:49:32 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: The current (V4.2+) levels of code support bigger directory block sizes, so it's no longer an issue with something like 1M metadata block size. In fact, there isn't a whole lot of difference between 256K and 1M metadata block sizes, either would work fine. There isn't really a downside in selecting a different block size for metadata though. Inode size (mmcrfs -i option) is orthogonal to the metadata block size selection. We do strongly recommend using 4K inodes to anyone. There's the obvious downside of needing more metadata storage for inodes, but the advantages are significant. yuri From: Jan-Frode Myklebust To: gpfsug main discussion list , Date: 09/22/2016 12:25 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. ? -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Fri Sep 23 02:48:44 2016 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 23 Sep 2016 01:48:44 +0000 Subject: [gpfsug-discuss] Learn a new cluster Message-ID: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> What commands would you run to learn all you need to know about a cluster you?ve never seen before? Captain Obvious (me) says: mmlscluster mmlsconfig mmlsnode mmlsnsd mmlsfs all What others? Mark R. Bush | Solutions Architect This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 23 02:50:52 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 22 Sep 2016 21:50:52 -0400 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> Message-ID: <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> Perhaps a gpfs.snap? This could tell you a *lot* about a cluster. -Aaron On 9/22/16 9:48 PM, Mark.Bush at siriuscom.com wrote: > What commands would you run to learn all you need to know about a > cluster you?ve never seen before? > > Captain Obvious (me) says: > > mmlscluster > > mmlsconfig > > mmlsnode > > mmlsnsd > > mmlsfs all > > > > What others? > > > > > > Mark R. Bush | Solutions Architect > > > > This message (including any attachments) is intended only for the use of > the individual or entity to which it is addressed and may contain > information that is non-public, proprietary, privileged, confidential, > and exempt from disclosure under applicable law. If you are not the > intended recipient, you are hereby notified that any use, dissemination, > distribution, or copying of this communication is strictly prohibited. > This message may be viewed by parties at Sirius Computer Solutions other > than those named in the message header. This message does not contain an > official representation of Sirius Computer Solutions. If you have > received this communication in error, notify Sirius Computer Solutions > immediately and (i) destroy this message if a facsimile or (ii) delete > this message immediately if this is an electronic communication. Thank you. > > Sirius Computer Solutions > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Greg.Lehmann at csiro.au Fri Sep 23 02:53:14 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 23 Sep 2016 01:53:14 +0000 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> Message-ID: <40b22b40d6ed4e38be115e9f6ae8d48d@exch1-cdc.nexus.csiro.au> Nice question. I?d also look at the non-GPFS settings IBM recommend in various places like the FAQ for things like ssh, network, etc. The importance of these is variable depending on cluster size/network configuration etc. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark.Bush at siriuscom.com Sent: Friday, 23 September 2016 11:49 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Learn a new cluster What commands would you run to learn all you need to know about a cluster you?ve never seen before? Captain Obvious (me) says: mmlscluster mmlsconfig mmlsnode mmlsnsd mmlsfs all What others? Mark R. Bush | Solutions Architect This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Sep 23 17:31:59 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 23 Sep 2016 12:31:59 -0400 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> Message-ID: <078081B8-E50E-46BE-B3AC-4C1DB6D963E1@ulmer.org> This was going to be my exact suggestion. My short to-learn list includes learn how to look inside a gpfs.snap for what I want to know. I?ve found the ability to do this with other snapshot bundles very useful in the past (for example I?ve used snap on AIX rather than my own scripts in some cases). Do be aware the gpfs.snap (and actually most ?create a bundle for support? commands on most platforms) are a little heavy. Liberty, -- Stephen > On Sep 22, 2016, at 9:50 PM, Aaron Knister wrote: > > Perhaps a gpfs.snap? This could tell you a *lot* about a cluster. > > -Aaron > > On 9/22/16 9:48 PM, Mark.Bush at siriuscom.com wrote: >> What commands would you run to learn all you need to know about a >> cluster you?ve never seen before? >> >> Captain Obvious (me) says: >> >> mmlscluster >> >> mmlsconfig >> >> mmlsnode >> >> mmlsnsd >> >> mmlsfs all >> >> >> >> What others? >> >> >> >> >> >> Mark R. Bush | Solutions Architect >> >> >> >> This message (including any attachments) is intended only for the use of >> the individual or entity to which it is addressed and may contain >> information that is non-public, proprietary, privileged, confidential, >> and exempt from disclosure under applicable law. If you are not the >> intended recipient, you are hereby notified that any use, dissemination, >> distribution, or copying of this communication is strictly prohibited. >> This message may be viewed by parties at Sirius Computer Solutions other >> than those named in the message header. This message does not contain an >> official representation of Sirius Computer Solutions. If you have >> received this communication in error, notify Sirius Computer Solutions >> immediately and (i) destroy this message if a facsimile or (ii) delete >> this message immediately if this is an electronic communication. Thank you. >> >> Sirius Computer Solutions > >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Sep 23 20:16:06 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 23 Sep 2016 15:16:06 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 23 22:35:12 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 23 Sep 2016 14:35:12 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stef.coene at docum.org Fri Sep 23 23:34:49 2016 From: stef.coene at docum.org (Stef Coene) Date: Sat, 24 Sep 2016 00:34:49 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> On 09/22/2016 08:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... I have read the replies and at the end, this is what we will do: Since the back-end storage will be V5000 with a default stripe size of 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is the best choice for block size. So 2 MB block size for data is the best choice. Since the block size for metadata is not that important in the latest releases, we will also go for 2 MB block size for metadata. Inode size will be left at the default: 4 KB. Stef From mimarsh2 at vt.edu Sat Sep 24 02:21:30 2016 From: mimarsh2 at vt.edu (Brian Marshall) Date: Fri, 23 Sep 2016 21:21:30 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> Message-ID: To keep this great chain going: If my metadata is on FLASH, would having a smaller blocksize for the system pool (metadata only) be helpful. My filesystem blocksize is 8MB On Fri, Sep 23, 2016 at 6:34 PM, Stef Coene wrote: > On 09/22/2016 08:36 PM, Stef Coene wrote: > >> Hi, >> >> Is it needed to specify a different blocksize for the system pool that >> holds the metadata? >> >> IBM recommends a 1 MB blocksize for the file system. >> But I wonder a smaller blocksize (256 KB or so) for metadata is a good >> idea or not... >> > I have read the replies and at the end, this is what we will do: > Since the back-end storage will be V5000 with a default stripe size of > 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is > the best choice for block size. > So 2 MB block size for data is the best choice. > > Since the block size for metadata is not that important in the latest > releases, we will also go for 2 MB block size for metadata. > > Inode size will be left at the default: 4 KB. > > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Sat Sep 24 05:07:02 2016 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Sat, 24 Sep 2016 04:07:02 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Not pendant but correct I flip there it is 1/32 -- Cheers > On 23 Sep 2016, at 22.16, Stephen Ulmer wrote: > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. > > -- > Stephen > > > >> On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: >> >> Hi >> >> My 2 cents. >> >> Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) >> >> About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. >> >> No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) >> >> POSIX >> find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out >> GPFS >> cd /usr/lpp/mmfs/samples/ilm >> gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile >> ./mmfind /gpfs/shared -ls -type f > find_ls_files.out >> CONVERT to CSV >> >> POSIX >> cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv >> GPFS >> cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv >> LOAD in octave >> >> FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); >> Clean the second column (OPTIONAL as the next clean up will do the same) >> >> FILESIZE(:,[2]) = []; >> If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) >> >> FILESIZE(FILESIZE<=3584) =[]; >> If we are not we need to clean the 0 size files >> >> FILESIZE(FILESIZE==0) =[]; >> Median >> >> FILESIZEMEDIAN = int32 (median (FILESIZE)) >> Mean >> >> FILESIZEMEAN = int32 (mean (FILESIZE)) >> Variance >> >> int32 (var (FILESIZE)) >> iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. >> >> int32 (iqr (FILESIZE)) >> Standard deviation >> >> >> For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. >> >> >> >> -- >> Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations >> >> Luis Bolinches >> Lab Services >> http://www-03.ibm.com/systems/services/labservices/ >> >> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland >> Phone: +358 503112585 >> >> "If you continually give you will continually have." Anonymous >> >> >> ----- Original message ----- >> From: Stef Coene >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> To: gpfsug main discussion list >> Cc: >> Subject: Re: [gpfsug-discuss] Blocksize >> Date: Thu, Sep 22, 2016 10:30 PM >> >> On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: >> > It defaults to 4k: >> > mmlsfs testbs8M -i >> > flag value description >> > ------------------- ------------------------ >> > ----------------------------------- >> > -i 4096 Inode size in bytes >> > >> > I think you can make as small as 512b. Gpfs will store very small >> > files in the inode. >> > >> > Typically you want your average file size to be your blocksize and your >> > filesystem has one blocksize and one inodesize. >> >> The files are not small, but around 20 MB on average. >> So I calculated with IBM that a 1 MB or 2 MB block size is best. >> >> But I'm not sure if it's better to use a smaller block size for the >> metadata. >> >> The file system is not that large (400 TB) and will hold backup data >> from CommVault. >> >> >> Stef >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> Ellei edell? ole toisin mainittu: / Unless stated otherwise above: >> Oy IBM Finland Ab >> PL 265, 00101 Helsinki, Finland >> Business ID, Y-tunnus: 0195876-3 >> Registered in Finland >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sat Sep 24 15:18:38 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sat, 24 Sep 2016 14:18:38 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sat Sep 24 17:18:11 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 24 Sep 2016 12:18:11 -0400 Subject: [gpfsug-discuss] Blocksize and MetaData Blocksizes - FORGET the old advice In-Reply-To: References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Metadata is inodes, directories, indirect blocks (indices). Spectrum Scale (GPFS) Version 4.1 introduced significant improvements to the data structures used to represent directories. Larger inodes supporting data and extended attributes in the inode are other significant relatively recent improvements. Now small directories are stored in the inode, while for large directories blocks can be bigger than 32MB, and any and all directory blocks that are smaller than the metadata-blocksize, are allocated just like "fragments" - so directories are now space efficient. SO MUCH SO, that THE OLD ADVICE, about using smallish blocksizes for metadata, GOES "OUT THE WINDOW". Period. FORGET most of what you thought you knew about "best" or "optimal" metadata-blocksize. The new advice is, as Sven wrote: Use a blocksize that optimizes IO transfer efficiency and speed. This is true for BOTH data and metadata. Now, IF you have system pool set up as metadata only AND system pool is on devices that have a different "optimal" block size than your other pools, THEN, it may make sense to use two different blocksizes, one for data and another for metadata. For example, maybe you have massively striped RAID or RAID-LIKE (GSS or ESS)) storage for huge files - so maybe 8MB is a good blocksize for that. But maybe you have your metadata on SSD devices and maybe 1MB is the "best" blocksize for that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sat Sep 24 18:31:37 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 24 Sep 2016 13:31:37 -0400 Subject: [gpfsug-discuss] Blocksize - consider IO transfer efficiency above your other prejudices In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: (I can answer your basic questions, Sven has more experience with tuning very large file systems, so perhaps he will have more to say...) 1. Inodes are packed into the file of inodes. (There is one file of all the inodes in a filesystem). If you have metadata-blocksize 1MB you will have 256 of 4KB inodes per block. Forget about sub-blocks when it comes to the file of inodes. 2. IF a file's data fits in its inode, then migrating that file from one pool to another just changes the preferred pool name in the inode. No data movement. Should the file later "grow" to require a data block, that data block will be allocated from whatever pool is named in the inode at that time. See the email I posted earlier today. Basically: FORGET what you thought you knew about optimal metadata blocksize (perhaps based on how you thought metadata was laid out on disk) and just stick to optimal IO transfer blocksizes. Yes, there may be contrived scenarios or even a few real live special cases, but those would be few and far between. Try following the newer general, easier, rule and see how well it works. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/24/2016 10:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Sat Sep 24 19:16:49 2016 From: stef.coene at docum.org (Stef Coene) Date: Sat, 24 Sep 2016 20:16:49 +0200 Subject: [gpfsug-discuss] Maximum NSD size Message-ID: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> Hi, When formatting the NDS for a new file system, I noticed a warning about a maximum size: Formatting file system ... Disks up to size 8.8 TB can be added to storage pool system. Disks up to size 9.0 TB can be added to storage pool V5000. I searched the docs, but I couldn't find any reference regarding the maximum size of NSDs? Stef From oehmes at gmail.com Sun Sep 25 17:25:40 2016 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 25 Sep 2016 16:25:40 +0000 Subject: [gpfsug-discuss] Maximum NSD size In-Reply-To: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> References: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> Message-ID: the limit you see above is NOT the max NSD limit for Scale/GPFS, its rather the limit of the NSD size you can add to this Filesystems pool. depending on which version of code you are running, we limit the maximum size of a NSD that can be added to a pool so you don't have mixtures of lets say 1 TB and 100 TB disks in one pool as this will negatively affect performance. in older versions we where more restrictive than in newer versions. Sven On Sat, Sep 24, 2016 at 11:16 AM Stef Coene wrote: > Hi, > > When formatting the NDS for a new file system, I noticed a warning about > a maximum size: > > Formatting file system ... > Disks up to size 8.8 TB can be added to storage pool system. > Disks up to size 9.0 TB can be added to storage pool V5000. > > I searched the docs, but I couldn't find any reference regarding the > maximum size of NSDs? > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Sun Sep 25 18:11:12 2016 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 25 Sep 2016 17:11:12 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: well, its not that easy and there is no perfect answer here. so lets start with some data points that might help decide: inodes, directory blocks, allocation maps for data as well as metadata don't follow the same restrictions as data 'fragments' or subblocks, means they are not bond to the 1/32 of the blocksize. they rather get organized on calculated sized blocks which can be very small (significant smaller than 1/32th) or close to the max of the blocksize for a single object. therefore the space waste concern doesn't really apply here. policy scans loves larger blocks as the blocks will be randomly scattered across the NSD's and therefore larger contiguous blocks for inode scan will perform significantly faster on larger metadata blocksizes than on smaller (assuming this is disk, with SSD's this doesn't matter that much) so for disk based systems it is advantageous to use larger blocks , for SSD based its less of an issue. you shouldn't choose on the other hand too large blocks even for disk drive based systems as there is one catch to all this. small updates on metadata typically end up writing the whole metadata block e.g. 256k for a directory block which now need to be destaged and read back from another node changing the same block. hope this helps. Sven On Sat, Sep 24, 2016 at 7:18 AM Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu> wrote: > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 > MB and am very interested in learning the rationale behind this as I am > currently looking at all aspects of our current GPFS configuration and the > possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool > and the default size of an inode is 4K (which we would do, since we have > recently discovered that even on our scratch filesystem we have a bazillion > files that are 4K or smaller and could therefore have their data stored in > the inode, right?), then why would you set the metadata block size to > anything larger than 128K when a sub-block is 1/32nd of a block? I.e., > with a 1 MB block size for metadata wouldn?t you be wasting a massive > amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. Then I > have two other dataOnly pools that are spinning disk. One is for ?regular? > access and the other is the ?capacity? pool ? i.e. a pool of slower storage > where we move files with large access times. I have a policy that says > something like ?move all files with an access time > 6 months to the > capacity pool.? Of those bazillion files less than 4K in size that are > fitting in the inode currently, probably half a bazillion () of them > would be subject to that rule. Will they get moved to the spinning disk > capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > > your metadata block size these days should be 1 MB and there are only very > few workloads for which you should run with a filesystem blocksize below 1 > MB. so if you don't know exactly what to pick, 1 MB is a good starting > point. > the general rule still applies that your filesystem blocksize (metadata or > data pool) should match your raid controller (or GNR vdisk) stripe size of > the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB and > therefore the blocksize of this pool should be 1 MB. i see many customers > in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or > above and your performance will be significant impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block size > (which strengt > > > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------ > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the > block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about > metadata block size. You can specify that the system pool have a different > block size from the rest of the filesystem, providing that it ONLY holds > metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without > some counter-indication), I would think that we?d want to use a metadata > block size of 4K*32=128K. This is independent of the regular block size, > which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I > don?t know what it is. I?d be happy to be corrected about this if it?s out > of whack. > > -- > Stephen > > > > On Sep 22, 2016, at 3:37 PM, Luis Bolinches < > *luis.bolinches at fi.ibm.com* > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave there. > GPFS uses sublocks where 1/16th of the BS can be allocated to different > files, so the "waste" is much less than you think on 1MB and you get the > throughput and less structures of much more data blocks. > > No* warranty at all* but I try to do this when the BS talk comes > in: (might need some clean up it could not be last note but you get the > idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the > same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a lower > number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything could not > manage on a 64GB RAM Power box. Most of the times it is enough with my > laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / > Salutations > > Luis Bolinches > Lab Services > *http://www-03.ibm.com/systems/services/labservices/* > > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene <*stef.coene at docum.org* > > Sent by: *gpfsug-discuss-bounces at spectrumscale.org* > > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very > small > > files in the inode. > > > > Typically you want your average file size to be your blocksize > and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alandhae at gmx.de Mon Sep 26 08:53:48 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Mon, 26 Sep 2016 09:53:48 +0200 (CEST) Subject: [gpfsug-discuss] File-Access Reporting Message-ID: hello all GPFS 'ehmm Spectrum Scale experts out there, we are using GPFS as a Filesystem for a new Data Application. They have defined the need to get reports about: Transfer volume [or file access]: by user, ..., by service, by product type ... at least on a daily basis. they need a report about: fileopen, fileclose, or requestEndTime, requestDuration, fileProductName [path and filename], dataSize. userId. I could think of, using sysstat (sar) for getting some of the numbers, but not being sure, if the numbers we will be receiving are correct. Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From alandhae at gmx.de Mon Sep 26 13:12:18 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Mon, 26 Sep 2016 14:12:18 +0200 (CEST) Subject: [gpfsug-discuss] File_heat for GPFS File Systems Questions over Questions ... Message-ID: Hello GPFS experts, customer wanting a report about the usage of the usage including file_heat in a large Filesystem. The report should be taken every month. mmchconfig fileHeatLossPercent=10,fileHeatPeriodMinutes=30240 -i fileHeatPeriodMinutes=30240 equals to 21 days. I#m wondering about the behavior of fileHeatLossPercent. - If it is set to 10, will file_heat decrease from 1 to 0 in 10 steps? - Or does file_heat have an asymptotic behavior, and heat 0 will never be reached? Anyways the results will be similar ;-) latter taking longer. We want to achieve following file lists: - File_Heat > 50% -> rather hot data - File_Heat 50% < x < 20 -> lukewarm data - File_Heat 20% <= x <= 0% -> ice cold data We will have to work on the limits between the File_Heat classes, depending on customers wishes. Are there better parameter settings for achieving this? Do any scripts/programs exist for analyzing the file_heat data? We have observed when taking policy runs on a large GPFS file system, the meta data performance significantly dropped, until job was finished. It took about 15 minutes on a 880 TB with 150 Mio entries GPFS file system. How is the behavior, when file_heat is being switched on? Do all files in the GPFS have the same temperature? Thanks for your help Ciao Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From makaplan at us.ibm.com Mon Sep 26 16:11:52 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 26 Sep 2016 11:11:52 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: fileHeatLossPercent=10, fileHeatPeriodMinutes=1440 means any file that has not been accessed for 1440 minutes (24 hours = 1 day) will lose 10% of its Heat. So if it's heat was X at noon today, tomorrow 0.90 X, the next day 0.81X, on the k'th day (.90)**k * X. After 63 fileHeatPeriods, we always round down and compute file heat as 0.0. The computation (in floating point with some approximations) is done "on demand" based on a heat value stored in the Inode the last time the unix access "atime" and the current time. So the cost of maintaining FILE_HEAT for a file is some bit twiddling, but only when the file is accessed and the atime would be updated in the inode anyway. File heat increases by approximately 1.0 each time the entire file is read from disk. This is done proportionately so if you read in half of the blocks the increase is 0.5. If you read all the blocks twice FROM DISK the file heat is increased by 2. And so on. But only IOPs are charged. If you repeatedly do posix read()s but the data is in cache, no heat is added. The easiest way to observe FILE_HEAT is with the mmapplypolicy directory -I test -L 2 -P fileheatrule.policy RULE 'fileheatrule' LIST 'hot' SHOW('Heat=' || varchar(FILE_HEAT)) /* in file fileheatfule.policy */ Because policy reads metadata from inodes as stored on disk, when experimenting/testing you may need to mmfsctl fs suspend-write; mmfsctl fs resume to see results immediately. From: Andreas Landh?u?er To: gpfsug-discuss at spectrumscale.org Date: 09/26/2016 08:12 AM Subject: [gpfsug-discuss] File_heat for GPFS File Systems Questions over Questions ... Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello GPFS experts, customer wanting a report about the usage of the usage including file_heat in a large Filesystem. The report should be taken every month. mmchconfig fileHeatLossPercent=10,fileHeatPeriodMinutes=30240 -i fileHeatPeriodMinutes=30240 equals to 21 days. I#m wondering about the behavior of fileHeatLossPercent. - If it is set to 10, will file_heat decrease from 1 to 0 in 10 steps? - Or does file_heat have an asymptotic behavior, and heat 0 will never be reached? Anyways the results will be similar ;-) latter taking longer. We want to achieve following file lists: - File_Heat > 50% -> rather hot data - File_Heat 50% < x < 20 -> lukewarm data - File_Heat 20% <= x <= 0% -> ice cold data We will have to work on the limits between the File_Heat classes, depending on customers wishes. Are there better parameter settings for achieving this? Do any scripts/programs exist for analyzing the file_heat data? We have observed when taking policy runs on a large GPFS file system, the meta data performance significantly dropped, until job was finished. It took about 15 minutes on a 880 TB with 150 Mio entries GPFS file system. How is the behavior, when file_heat is being switched on? Do all files in the GPFS have the same temperature? Thanks for your help Ciao Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 26 19:18:15 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 26 Sep 2016 11:18:15 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ulmer at ulmer.org Mon Sep 26 20:01:56 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Mon, 26 Sep 2016 15:01:56 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > > It's important to understand the differences between different metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. > > The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. > the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. > > -- > Stephen > > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 26 20:29:18 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 26 Sep 2016 12:29:18 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Message-ID: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri From: Stephen Ulmer To: gpfsug main discussion list , Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme < oehmes at us.ibm.com> wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene < stef.coene at docum.org> Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From alandhae at gmx.de Tue Sep 27 10:04:02 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Tue, 27 Sep 2016 11:04:02 +0200 (CEST) Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: On Mon, 26 Sep 2016, Marc A Kaplan wrote: Marc, thanks for your explanation, > fileHeatLossPercent=10, fileHeatPeriodMinutes=1440 > > means any file that has not been accessed for 1440 minutes (24 hours = 1 > day) will lose 10% of its Heat. > > So if it's heat was X at noon today, tomorrow 0.90 X, the next day 0.81X, > on the k'th day (.90)**k * X. > After 63 fileHeatPeriods, we always round down and compute file heat as > 0.0. > > The computation (in floating point with some approximations) is done "on > demand" based on a heat value stored in the Inode the last time the unix > access "atime" and the current time. So the cost of maintaining > FILE_HEAT for a file is some bit twiddling, but only when the file is > accessed and the atime would be updated in the inode anyway. > > File heat increases by approximately 1.0 each time the entire file is read > from disk. This is done proportionately so if you read in half of the > blocks the increase is 0.5. > If you read all the blocks twice FROM DISK the file heat is increased by > 2. And so on. But only IOPs are charged. If you repeatedly do posix > read()s but the data is in cache, no heat is added. with the above definition file heat >= 0.0 e.g. any positive floating point value is valid. I need to categorize the files into categories hot, warm, lukewarm and cold. How do I achieve this, since the maximum heat is varying and need to be defined every time when requesting the report. We are wishing to migrate data according to the heat onto different storage categories (expensive --> cheap devices) > The easiest way to observe FILE_HEAT is with the mmapplypolicy directory > -I test -L 2 -P fileheatrule.policy > > RULE 'fileheatrule' LIST 'hot' SHOW('Heat=' || varchar(FILE_HEAT)) /* in > file fileheatfule.policy */ > > Because policy reads metadata from inodes as stored on disk, when > experimenting/testing you may need to > > mmfsctl fs suspend-write; mmfsctl fs resume Doing this on a production file system, a valid change request need to be filed, and description of the risks for customers data and so on have to be defined (ITIL) ... Any help and ideas will be appreciated Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From makaplan at us.ibm.com Tue Sep 27 15:25:04 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 10:25:04 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: You asked ... "We are wishing to migrate data according to the heat onto different storage categories (expensive --> cheap devices)" We suggest a policy rule like this: Rule 'm' Migrate From Pool 'Expensive' To Pool 'Thrifty' Threshold(90,75) Weight(-FILE_HEAT) /* minus sign! */ Which you can interpret as: When The 'Expensive' pool is 90% or more full, Migrate the lowest heat (coldest!) files to pool 'Thrifty', until the occupancy of 'Expensive' has been reduced to 75%. The concepts of Threshold and Weight have been in the produce since the MIGRATE rule was introduced. Another concept we introduced at the same time as FILE_HEAT was GROUP POOL. We've had little feedback and very few questions about this, so either it works great or is not being used much. (Maybe both are true ;-) ) GROUP POOL migration is documented in the Information Lifecycle Management chapter along with the other elements of the policy rules. In the 4.2.1 doc we suggest you can "repack" several pools with one GROUP POOL rule and one MIGRATE rule like this: You can ?repack? a group pool by WEIGHT. Migrate files of higher weight to preferred disk pools by specifying a group pool as both the source and the target of a MIGRATE rule. rule ?grpdef? GROUP POOL ?gpool? IS ?ssd? LIMIT(90) THEN ?fast? LIMIT(85) THEN ?sata? rule ?repack? MIGRATE FROM POOL ?gpool? TO POOL ?gpool? WEIGHT(FILE_HEAT) This should rank all the files in the three pools from hottest to coldest, and migrate them as necessary (if feasible) so that 'ssd' is up to 90% full of the hottest, 'fast' is up to 85% full of the next most hot, and the coolest files will be migrated to 'sata'. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 27 18:02:45 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 27 Sep 2016 17:02:45 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Message-ID: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Yuri / Sven / anyone else who wants to jump in, First off, thank you very much for your answers. I?d like to follow up with a couple of more questions. 1) Let?s assume that our overarching goal in configuring the block size for metadata is performance from the user perspective ? i.e. how fast is an ?ls -l? on my directory? Space savings aren?t important, and how long policy scans or other ?administrative? type tasks take is not nearly as important as that directory listing. Does that change the recommended metadata block size? 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional HPC use for those two) and /data (project space). Our storage arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 mirror, and two hot spare drives. The RAID 1 mirrors are for /home, the RAID 6 LUNs are for /scratch or /data. /home has tons of small files - so small that a 64K block size is currently used. /scratch and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. If you could ?start all over? with the same hardware being the only restriction, would you: a) merge /scratch and /data into one filesystem but keep /home separate since the LUN sizes are so very different, or b) merge all three into one filesystem and use storage pools so that /home is just a separate pool within the one filesystem? And if you chose this option would you assign different block sizes to the pools? Again, I?m asking these questions because I may have the opportunity to effectively ?start all over? and want to make sure I?m doing things as optimally as possible. Thanks? Kevin On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: From: Stephen Ulmer > To: gpfsug main discussion list >, Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: * There are items OTHER THAN INODES stored as metadata in GPFS. * These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. * Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. * If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. * We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. * Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list >, Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 27 18:16:52 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 13:16:52 -0400 Subject: [gpfsug-discuss] Blocksize, yea, inode size! In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: inode size will be a crucial choice in the scenario you describe. Consider the conflict: A large inode can hold a complete file or a complete directory. But the bigger the inode size, the less that fit in any given block size -- so when you have to read several inodes ... more IO, less likely that inodes you want are in the same block. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Tue Sep 27 18:23:34 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Tue, 27 Sep 2016 10:23:34 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: On 09/27/2016 10:02 AM, Buterbaugh, Kevin L wrote: > 1) Let?s assume that our overarching goal in configuring the block size > for metadata is performance from the user perspective ? i.e. how fast is > an ?ls -l? on my directory? Space savings aren?t important, and how > long policy scans or other ?administrative? type tasks take is not > nearly as important as that directory listing. Does that change the > recommended metadata block size? You need to put your metadata on SSDs. Make your SSDs the only members in your 'system' pool and put your other devices into another pool, and make that pool 'dataOnly'. If your SSDs are large enough to also hold some data, that's great; I typically do a migration policy to copy files smaller than filesystem block size (or definitely smaller than sub-block size) to the SSDs. Also, files smaller than 4k will usually fit into the inode (if you are using the 4k inode size). I have a system where the SSDs are regularly doing 6-7k IOPS for metadata stuff. If those same 7k IOPS were spread out over the slow data LUNs... which only have like 100 IOPS per 8+2P LUN... I'd be consuming 700 disks just for metadata IOPS. -- Alex Chekholko chekh at stanford.edu From kevindjo at us.ibm.com Tue Sep 27 18:33:29 2016 From: kevindjo at us.ibm.com (Kevin D Johnson) Date: Tue, 27 Sep 2016 17:33:29 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From alandhae at gmx.de Tue Sep 27 19:04:06 2016 From: alandhae at gmx.de (=?UTF-8?Q?Andreas_Landh=c3=a4u=c3=9fer?=) Date: Tue, 27 Sep 2016 20:04:06 +0200 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: as far as I understand, if a file gets hot again, there is no rule for putting the file back into a faster storage device? We would like having something like a storage elevator depending on the fileheat. In our setup, customer likes to migrate/move data even when the the threshold is not hit, just because it's cold and the price of the storage is less. On 27.09.2016 16:25, Marc A Kaplan wrote: > > You asked ... "We are wishing to migrate data according to the heat > onto different > storage categories (expensive --> cheap devices)" > > > We suggest a policy rule like this: > > Rule 'm' Migrate From Pool 'Expensive' To Pool 'Thrifty' > Threshold(90,75) Weight(-FILE_HEAT) /* minus sign! */ > > > Which you can interpret as: > > When The 'Expensive' pool is 90% or more full, Migrate the lowest heat > (coldest!) files to pool 'Thrifty', until > the occupancy of 'Expensive' has been reduced to 75%. > > The concepts of Threshold and Weight have been in the produce since > the MIGRATE rule was introduced. > > Another concept we introduced at the same time as FILE_HEAT was GROUP > POOL. We've had little feedback and very > few questions about this, so either it works great or is not being > used much. (Maybe both are true ;-) ) > > GROUP POOL migration is documented in the Information Lifecycle > Management chapter along with the other elements of the policy rules. > > In the 4.2.1 doc we suggest you can "repack" several pools with one > GROUP POOL rule and one MIGRATE rule like this: > > You can ?repack? a group pool by *WEIGHT*. Migrate files of higher > weight to preferred disk pools > by specifying a group pool as both the source and the target of a > *MIGRATE *rule. > > rule ?grpdef? GROUP POOL ?gpool? IS ?ssd? LIMIT(90) THEN ?fast? > LIMIT(85) THEN ?sata? > rule ?repack? MIGRATE FROM POOL ?gpool? TO POOL ?gpool? WEIGHT(FILE_HEAT) > > > This should rank all the files in the three pools from hottest to > coldest, and migrate them > as necessary (if feasible) so that 'ssd' is up to 90% full of the > hottest, 'fast' is up to 85% full of the next > most hot, and the coolest files will be migrated to 'sata'. > > > > -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Sep 27 19:12:19 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 27 Sep 2016 18:12:19 +0000 Subject: [gpfsug-discuss] File_heat for GPFS File Systems Message-ID: <0217AC60-11F0-4CEB-AE91-22D25E4649DC@nuance.com> Sure, if you use a policy to migrate between two tiers, it will move files up or down based on heat. Something like this (flas and disk pools): rule grpdef GROUP POOL gpool IS flash LIMIT(75) THEN Disk rule repack MIGRATE FROM POOL gpool TO POOL gpool WEIGHT(FILE_HEAT) Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid 507-269-0413 From: on behalf of Andreas Landh?u?er Reply-To: gpfsug main discussion list Date: Tuesday, September 27, 2016 at 1:04 PM To: Marc A Kaplan , gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] File_heat for GPFS File Systems as far as I understand, if a file gets hot again, there is no rule for putting the file back into a faster storage device? -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Tue Sep 27 19:26:46 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 27 Sep 2016 11:26:46 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > To: gpfsug main discussion list , > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Tue Sep 27 19:51:50 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Tue, 27 Sep 2016 11:51:50 -0700 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> On 09/27/2016 11:04 AM, Andreas Landh?u?er wrote: > if a file gets hot again, there is no rule for putting the file back > into a faster storage device? The file will get moved when you run the policy again. You can run the policy as often as you like. There is also a way to use a GPFS hook to trigger policy run. Check 'mmaddcallback' But I think you have to be careful and think through the complexity. e.g. load spikes and pool fills up and your callback kicks in and starts a migration which increases the I/O load further, etc... Regards, -- Alex Chekholko chekh at stanford.edu From makaplan at us.ibm.com Tue Sep 27 20:27:47 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 15:27:47 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: Read about GROUP POOL - you can call as often as you like to "repack" the files into several pools from hot to cold. Of course, there is a cost to running mmapplypolicy... So maybe you'd just run it once every day or so... -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Sep 27 20:38:16 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 27 Sep 2016 21:38:16 +0200 Subject: [gpfsug-discuss] Samba via CES Message-ID: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Hello, does CES offer high availability of SMB? I.e., does used samba server provide cluster wide persistent handles? Or failover from node to node is currently not supported without the client disruption? -- Luk?? Hejtm?nek From erich at uw.edu Tue Sep 27 21:56:20 2016 From: erich at uw.edu (Eric Horst) Date: Tue, 27 Sep 2016 13:56:20 -0700 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> References: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> Message-ID: >> >> if a file gets hot again, there is no rule for putting the file back >> into a faster storage device? > > > The file will get moved when you run the policy again. You can run the > policy as often as you like. I think its worth stating clearly that if a file is in the Thrifty slow pool and a user opens and reads/writes the file there is nothing that moves this file to a different tier. A policy run is the only action that relocates files. So if you apply the policy daily and over the course of the day users access many cold files, the performance accessing those cold files may not be ideal until the next day when they are repacked by heat. A file is not automatically moved to the fast tier on access read or write. I mention this because this aspect of tiering was not immediately clear from the docs when I was a neophyte GPFS admin and I had to learn by observation. It is easy for one to make an assumption that it is a more dynamic tiering system than it is. -Eric -- Eric Horst University of Washington From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 27 22:21:23 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 27 Sep 2016 21:21:23 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Sep 27 22:36:37 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 27 Sep 2016 14:36:37 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Message-ID: When a CES node fails, protocol clients have to reconnect to one of the remaining nodes. Samba in CES does not support persistent handles. This is indicated in the documentation: http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits "Only mandatory SMB3 protocol features are supported. " Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org Date: 09/27/2016 12:38 PM Subject: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, does CES offer high availability of SMB? I.e., does used samba server provide cluster wide persistent handles? Or failover from node to node is currently not supported without the client disruption? -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From xhejtman at ics.muni.cz Tue Sep 27 22:42:57 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 27 Sep 2016 23:42:57 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Message-ID: <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> On Tue, Sep 27, 2016 at 02:36:37PM -0700, Christof Schmitt wrote: > When a CES node fails, protocol clients have to reconnect to one of the > remaining nodes. > > Samba in CES does not support persistent handles. This is indicated in the > documentation: > > http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits > > "Only mandatory SMB3 protocol features are supported. " well, but in this case, HA feature is a bit pointless as node fail results in a client failure as well as reconnect does not seem to be automatic if there is on going traffic.. more precisely reconnect is automatic but without persistent handles, the client receives write protect error immediately. -- Luk?? Hejtm?nek From Greg.Lehmann at csiro.au Wed Sep 28 08:40:35 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 07:40:35 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: <428599f3d6cb47ebb74d05178eeba2b8@exch1-cdc.nexus.csiro.au> I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alandhae at gmx.de Wed Sep 28 10:13:55 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Wed, 28 Sep 2016 11:13:55 +0200 (CEST) Subject: [gpfsug-discuss] Proposal for dynamic heat assisted file tiering Message-ID: On Tue, 27 Sep 2016, Eric Horst wrote: Thanks Eric for the hint, shouldn't we as the users define a requirement for such a dynamic heat assisted file tiering option (DHAFTO). Keeping track which files have increased heat and triggering a transparent move to a faster tier. Since I haven't tested it on a GPFS FS, I would like to know about the performance penalties being observed, when frequently running the policies, just a rough estimate. Of course its depending on the speed of the Metadata disks (yes, we use different devices for Metadata) we are also running GPFS on various GSS Systems. IBM might also want bundling this option together with GSS/ESS hardware for better performance. Just my 2? Andreas >>> >>> if a file gets hot again, there is no rule for putting the file back >>> into a faster storage device? >> >> >> The file will get moved when you run the policy again. You can run the >> policy as often as you like. > > I think its worth stating clearly that if a file is in the Thrifty > slow pool and a user opens and reads/writes the file there is nothing > that moves this file to a different tier. A policy run is the only > action that relocates files. So if you apply the policy daily and over > the course of the day users access many cold files, the performance > accessing those cold files may not be ideal until the next day when > they are repacked by heat. A file is not automatically moved to the > fast tier on access read or write. I mention this because this aspect > of tiering was not immediately clear from the docs when I was a > neophyte GPFS admin and I had to learn by observation. It is easy for > one to make an assumption that it is a more dynamic tiering system > than it is. -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From Robert.Oesterlin at nuance.com Wed Sep 28 11:56:51 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 28 Sep 2016 10:56:51 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution Message-ID: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: on behalf of "Greg.Lehmann at csiro.au" Reply-To: gpfsug main discussion list Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Sep 28 14:45:14 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 28 Sep 2016 13:45:14 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: Greg, Not saying this is the right way to go, but I rolled my own. I wrote a very simple Perl script that essentially does the Perl equivalent of a find on my GPFS filesystems, then stat?s files and directories and writes the output to a text file. I run that one overnight or on the weekends. Takes about 6 hours to run across our 3 GPFS filesystems with metadata on SSDs. Then I?ve written a couple of different Perl scripts to analyze that data in different ways (the idea being that collecting the data is ?expensive? ? but once you?ve got it it?s cheap to analyze it in different ways). But the one I?ve been using for this project just breaks down the number of files and directories by size and age and produces a table. Rather than try to describe this, here?s sample output: For input file: gpfsFileInfo_20160915.txt <1 day | <1 wk | <1 mo | <2 mo | <3 mo | <4 mo | <5 mo | <6 mo | <1 yr | >1 year | Total Files <1 KB 29538 111364 458260 634398 150199 305715 4388443 93733 966618 3499535 10637803 <2 KB 9875 20580 119414 295167 35961 67761 80462 33688 269595 851641 1784144 <4 KB 9212 45282 168678 496796 27771 23887 105135 23161 259242 1163327 2322491 <8 KB 4992 29284 105836 303349 28341 20346 246430 28394 216061 1148459 2131492 <16 KB 3987 18391 92492 218639 20513 19698 675097 30976 190691 851533 2122017 <32 KB 4844 12479 50235 265222 24830 18789 1058433 18030 196729 1066287 2715878 <64 KB 6358 24259 29474 222134 17493 10744 1381445 11358 240528 1123540 3067333 <128 KB 6531 59107 206269 186213 71823 114235 1008724 36722 186357 845921 2721902 <256 KB 1995 17638 19355 436611 8505 7554 3582738 7519 249510 744885 5076310 <512 KB 20645 12401 24700 111463 5659 22132 1121269 10774 273010 725155 2327208 <1 MB 2681 6482 37447 58459 6998 14945 305108 5857 160360 386152 984489 <4 MB 4554 84551 23320 100407 6818 32833 129758 22774 210935 528458 1144408 <1 GB 56652 33538 99667 87778 24313 68372 118928 42554 251528 916493 1699823 <10 GB 1245 2482 4524 3184 1043 1794 2733 1694 8731 20462 47892 <100 GB 47 230 470 265 92 198 172 122 1276 2061 4933 >100 GB 2 3 12 1 14 4 5 1 37 165 244 Total TB: 6.49 13.22 30.56 18.00 10.22 15.69 19.87 12.48 73.47 187.44 Grand Total: 387.46 TB Everything other than the total space lines at the bottom are counts of number of files meeting that criteria. I?ve got another variation on the same script that we used when we were trying to determine how many old files we have and therefore how much data was an excellent candidate for moving to a slower, less expensive ?capacity? pool. I?m not sure how useful my tools would be to others ? I?m certainly not a professional programmer by any stretch of the imagination (and yes, I do hear those of you who are saying, ?Yeah, he?s barely a professional SysAdmin!? ). But others of you have been so helpful to me ? I?d like to try in some small way to help someone else. Kevin On Sep 28, 2016, at 5:56 AM, Oesterlin, Robert > wrote: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: > on behalf of "Greg.Lehmann at csiro.au" > Reply-To: gpfsug main discussion list > Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 15:34:05 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 10:34:05 -0400 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... SHOW rule) to gather the stats much faster. Should be minutes, not hours. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 16:23:12 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 11:23:12 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org><6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 16:33:29 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 11:33:29 -0400 Subject: [gpfsug-discuss] Proposal for dynamic heat assisted file tiering In-Reply-To: References: Message-ID: Suppose, we could "dynamically" change the pool assignment of a file. How/when would you have us do that? When will that generate unnecessary, "wasteful" IOPs? How do we know if/when/how often you will access a file in the future? This is similar to other classical caching policies, but there the choice is usually just which pages to flush from the cache when we need space ... The usual compromise is "LRU" but maybe some systems allow hints. When there are multiple pools, it seems more complicated, more degrees of freedom ... Would you be willing and able to write some new policy rules to provide directions to Spectrum Scale for dynamic tiering? What would that look like? Would it be worth the time and effort over what we have now? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Sep 28 19:13:35 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 28 Sep 2016 18:13:35 +0000 Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? Message-ID: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> What the largest file that will fit inside a 1K, 2K, or 4K inode? Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Wed Sep 28 21:18:55 2016 From: ewahl at osc.edu (Edward Wahl) Date: Wed, 28 Sep 2016 16:18:55 -0400 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: <20160928161855.1df32434@osc.edu> On Wed, 28 Sep 2016 10:34:05 -0400 Marc A Kaplan wrote: > Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... > SHOW rule) to gather the stats much faster. Should be minutes, not > hours. > I'll agree with the policy engine. Runs like a beast if you tune it a little for nodes and threads. Only takes a couple of minutes to collect info on over a hundred million files. Show where the data is now by pool and sort it by age with queries? quick hack up example. you could sort the mess on the front end fairly quickly. (use fileset or pool, etc as your storage needs) RULE '2yrold_files' LIST '2yrold_filelist.txt' SHOW (varchar(file_size) || ' ' || varchar(USER_ID) || ' ' || varchar(POOL_NAME)) WHERE DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) >= 730 AND DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) < 1095 don't forget to run the engine with the -I defer for this kind of list/show policy. Ed -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From christof.schmitt at us.ibm.com Wed Sep 28 21:33:45 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 28 Sep 2016 13:33:45 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: The client has to reconnect, open the file again and reissue request that have not been completed. Without persistent handles, the main risk is that another client can step in and access the same file in the meantime. With persistent handles, access from other clients would be prevented for a defined amount of time. Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug main discussion list Date: 09/27/2016 02:43 PM Subject: Re: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org On Tue, Sep 27, 2016 at 02:36:37PM -0700, Christof Schmitt wrote: > When a CES node fails, protocol clients have to reconnect to one of the > remaining nodes. > > Samba in CES does not support persistent handles. This is indicated in the > documentation: > > http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits > > "Only mandatory SMB3 protocol features are supported. " well, but in this case, HA feature is a bit pointless as node fail results in a client failure as well as reconnect does not seem to be automatic if there is on going traffic.. more precisely reconnect is automatic but without persistent handles, the client receives write protect error immediately. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bbanister at jumptrading.com Wed Sep 28 21:56:47 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 28 Sep 2016 20:56:47 +0000 Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? In-Reply-To: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> References: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB0633CA80@CHI-EXCHANGEW1.w2k.jumptrading.com> I think the guideline for 4K inodes is roughly 3.5KB depending on use of extended attributes, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Wednesday, September 28, 2016 1:14 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? What the largest file that will fit inside a 1K, 2K, or 4K inode? Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Sep 28 23:03:36 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 29 Sep 2016 00:03:36 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> On Wed, Sep 28, 2016 at 01:33:45PM -0700, Christof Schmitt wrote: > The client has to reconnect, open the file again and reissue request that > have not been completed. Without persistent handles, the main risk is that > another client can step in and access the same file in the meantime. With > persistent handles, access from other clients would be prevented for a > defined amount of time. well, I guess I cannot reconfigure the client so that reissuing request is done by OS and not rised up to the user? E.g., if user runs video encoding directly to Samba share and encoding runs for several hours, reissuing request, i.e., restart encoding, is not exactly what user accepts. -- Luk?? Hejtm?nek From abeattie at au1.ibm.com Wed Sep 28 23:25:01 2016 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 28 Sep 2016 22:25:01 +0000 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> References: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz>, <20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Wed Sep 28 23:49:31 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 22:49:31 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: <2ed56fe8c9c34eb5a1da25800b2951e0@exch1-cdc.nexus.csiro.au> Kevin, Thanks for the offer of help. I am capable of writing my own, but it looks like the best approach is to use mmapplypolicy, something I had not thought of. This is precisely the reason I asked what looks like a silly question. You don?t know what you don?t know! The quality of content on this list has been exceptional of late! Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 11:45 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize - file size distribution Greg, Not saying this is the right way to go, but I rolled my own. I wrote a very simple Perl script that essentially does the Perl equivalent of a find on my GPFS filesystems, then stat?s files and directories and writes the output to a text file. I run that one overnight or on the weekends. Takes about 6 hours to run across our 3 GPFS filesystems with metadata on SSDs. Then I?ve written a couple of different Perl scripts to analyze that data in different ways (the idea being that collecting the data is ?expensive? ? but once you?ve got it it?s cheap to analyze it in different ways). But the one I?ve been using for this project just breaks down the number of files and directories by size and age and produces a table. Rather than try to describe this, here?s sample output: For input file: gpfsFileInfo_20160915.txt <1 day | <1 wk | <1 mo | <2 mo | <3 mo | <4 mo | <5 mo | <6 mo | <1 yr | >1 year | Total Files <1 KB 29538 111364 458260 634398 150199 305715 4388443 93733 966618 3499535 10637803 <2 KB 9875 20580 119414 295167 35961 67761 80462 33688 269595 851641 1784144 <4 KB 9212 45282 168678 496796 27771 23887 105135 23161 259242 1163327 2322491 <8 KB 4992 29284 105836 303349 28341 20346 246430 28394 216061 1148459 2131492 <16 KB 3987 18391 92492 218639 20513 19698 675097 30976 190691 851533 2122017 <32 KB 4844 12479 50235 265222 24830 18789 1058433 18030 196729 1066287 2715878 <64 KB 6358 24259 29474 222134 17493 10744 1381445 11358 240528 1123540 3067333 <128 KB 6531 59107 206269 186213 71823 114235 1008724 36722 186357 845921 2721902 <256 KB 1995 17638 19355 436611 8505 7554 3582738 7519 249510 744885 5076310 <512 KB 20645 12401 24700 111463 5659 22132 1121269 10774 273010 725155 2327208 <1 MB 2681 6482 37447 58459 6998 14945 305108 5857 160360 386152 984489 <4 MB 4554 84551 23320 100407 6818 32833 129758 22774 210935 528458 1144408 <1 GB 56652 33538 99667 87778 24313 68372 118928 42554 251528 916493 1699823 <10 GB 1245 2482 4524 3184 1043 1794 2733 1694 8731 20462 47892 <100 GB 47 230 470 265 92 198 172 122 1276 2061 4933 >100 GB 2 3 12 1 14 4 5 1 37 165 244 Total TB: 6.49 13.22 30.56 18.00 10.22 15.69 19.87 12.48 73.47 187.44 Grand Total: 387.46 TB Everything other than the total space lines at the bottom are counts of number of files meeting that criteria. I?ve got another variation on the same script that we used when we were trying to determine how many old files we have and therefore how much data was an excellent candidate for moving to a slower, less expensive ?capacity? pool. I?m not sure how useful my tools would be to others ? I?m certainly not a professional programmer by any stretch of the imagination (and yes, I do hear those of you who are saying, ?Yeah, he?s barely a professional SysAdmin!? ). But others of you have been so helpful to me ? I?d like to try in some small way to help someone else. Kevin On Sep 28, 2016, at 5:56 AM, Oesterlin, Robert > wrote: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: > on behalf of "Greg.Lehmann at csiro.au" > Reply-To: gpfsug main discussion list > Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Wed Sep 28 23:54:36 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 22:54:36 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org><6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: Are there any presentation available online that provide diagrams of the directory/file creation process and modifications in terms of how the blocks/inodes and indirect blocks etc are used. I would guess there are a few different cases that would need to be shown. This is the sort of thing that would great in a decent text book on GPFS (doesn't exist as far as I am aware.) Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Thursday, 29 September 2016 1:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Sep 28 23:58:15 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 29 Sep 2016 00:58:15 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz> On Wed, Sep 28, 2016 at 10:25:01PM +0000, Andrew Beattie wrote: > In that scenario, would you not be better off using a native Spectrum > Scale client installed on the workstation that the video editor is using > with a local mapped drive, rather than a SMB share? > ? > This would prevent this the scenario you have proposed occurring. indeed, it would be better, but why one would have CES at all? I would like to use CES but it seems that it is not quite ready yet for such a scenario. -- Luk?? Hejtm?nek From christof.schmitt at us.ibm.com Thu Sep 29 00:06:59 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 28 Sep 2016 16:06:59 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> Message-ID: The exact behavior depends on the client and the application. I would suggest explicit testing of the protocol failover if that is a concern. Samba does not support persistent handles, so that would be a completely new feature. There is some support available for durable handles which have weaker guarantees, and which are also disabled in CES Samba due to known issues in large deployments. In cases where SMB protocol failover becomes an issue and durable handles might help, that might be an approach to improve the failover behavior. Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug main discussion list Date: 09/28/2016 03:04 PM Subject: Re: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org On Wed, Sep 28, 2016 at 01:33:45PM -0700, Christof Schmitt wrote: > The client has to reconnect, open the file again and reissue request that > have not been completed. Without persistent handles, the main risk is that > another client can step in and access the same file in the meantime. With > persistent handles, access from other clients would be prevented for a > defined amount of time. well, I guess I cannot reconfigure the client so that reissuing request is done by OS and not rised up to the user? E.g., if user runs video encoding directly to Samba share and encoding runs for several hours, reissuing request, i.e., restart encoding, is not exactly what user accepts. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From abeattie at au1.ibm.com Thu Sep 29 00:37:25 2016 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 28 Sep 2016 23:37:25 +0000 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz> References: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz>, <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz><20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: An HTML attachment was scrubbed... URL: From aaron.knister at gmail.com Thu Sep 29 02:43:52 2016 From: aaron.knister at gmail.com (Aaron Knister) Date: Wed, 28 Sep 2016 21:43:52 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at us.ibm.com Thu Sep 29 03:28:03 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 28 Sep 2016 19:28:03 -0700 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Hi Aaron, the best way to express this 'need' is to vote and leave comments in the RFE's : this is an RFE for GNR as SW : http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090 everybody who wants this to be one should vote for it and leave comments on what they expect. Sven From: Aaron Knister To: gpfsug-discuss at spectrumscale.org Date: 09/28/2016 06:44 PM Subject: Re: [gpfsug-discuss] gpfs native raid Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From daniel.kidger at uk.ibm.com Thu Sep 29 10:04:03 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Sep 2016 09:04:03 +0000 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT1-graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From daniel.kidger at uk.ibm.com Thu Sep 29 10:25:59 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Sep 2016 09:25:59 +0000 Subject: [gpfsug-discuss] AFM cacheset mounting from the same GPFS cluster ? Message-ID: An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 29 16:03:08 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 29 Sep 2016 15:03:08 +0000 Subject: [gpfsug-discuss] Fwd: Blocksize References: Message-ID: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Resending from the right e-mail address... Begin forwarded message: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:36 AM CDT To: klb at accre.vanderbilt.edu You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at gpfsug-discuss-owner at spectrumscale.org. From: "Kevin L. Buterbaugh" > Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:29 AM CDT To: gpfsug main discussion list > Hi Marc and others, I understand ? I guess I did a poor job of wording my question, so I?ll try again. The IBM recommendation for metadata block size seems to be somewhere between 256K - 1 MB, depending on who responds to the question. If I were to hypothetically use a 256K metadata block size, does the ?1/32nd of a block? come into play like it does for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode size) minimum? And here?s a really off the wall question ? yesterday we were discussing the fact that there is now a single inode file. Historically, we have always used RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for metadata and then use GPFS replication on top of that. But given that there is a single inode file is that ?old way? of doing things still the right way? In other words, could we potentially be better off by using a couple of 8+2P RAID 6 LUNs? One potential downside of that would be that we would then only have two NSD servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN and splitting it up into multiple logical volumes (all that done on the storage array, of course) and then presenting those to GPFS as NSDs??? Or have I gone from merely asking stupid questions to Trump-level craziness???? ;-) Kevin On Sep 28, 2016, at 10:23 AM, Marc A Kaplan > wrote: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 29 16:32:47 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 29 Sep 2016 11:32:47 -0400 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: Frankly, I just don't "get" what it is you seem not to be "getting" - perhaps someone else who does "get" it can rephrase: FORGET about Subblocks when thinking about inodes being packed into the file of all inodes. Additional facts that may address some of the other concerns: I started working on GPFS at version 3.1 or so. AFAIK GPFS always had and has one file of inodes, "packed", with no wasted space between inodes. Period. Full Stop. RAID! Now we come to a mistake that I've seen made by more than a handful of customers! It is generally a mistake to use RAID with parity (such as classic RAID5) to store metadata. Why? Because metadata is often updated with "small writes" - for example suppose we have to update some fields in an inode, or an indirect block, or append a log record... For RAID with parity and large stripe sizes -- this means that updating just one disk sector can cost a full stripe read + writing the changed data and parity sectors. SO, if you want protection against storage failures for your metadata, use either RAID mirroring/replication and/or GPFS metadata replication. (belt and/or suspenders) (Arguments against relying solely on RAID mirroring: single enclosure/box failure (fire!), single hardware design (bugs or defects), single firmware/microcode(bugs.)) Yes, GPFS is part of "the cyber." We're making it stronger everyday. But it already is great. --marc From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/29/2016 11:03 AM Subject: [gpfsug-discuss] Fwd: Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Resending from the right e-mail address... Begin forwarded message: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:36 AM CDT To: klb at accre.vanderbilt.edu You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at gpfsug-discuss-owner at spectrumscale.org. From: "Kevin L. Buterbaugh" Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:29 AM CDT To: gpfsug main discussion list Hi Marc and others, I understand ? I guess I did a poor job of wording my question, so I?ll try again. The IBM recommendation for metadata block size seems to be somewhere between 256K - 1 MB, depending on who responds to the question. If I were to hypothetically use a 256K metadata block size, does the ?1/32nd of a block? come into play like it does for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode size) minimum? And here?s a really off the wall question ? yesterday we were discussing the fact that there is now a single inode file. Historically, we have always used RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for metadata and then use GPFS replication on top of that. But given that there is a single inode file is that ?old way? of doing things still the right way? In other words, could we potentially be better off by using a couple of 8+2P RAID 6 LUNs? One potential downside of that would be that we would then only have two NSD servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN and splitting it up into multiple logical volumes (all that done on the storage array, of course) and then presenting those to GPFS as NSDs??? Or have I gone from merely asking stupid questions to Trump-level craziness???? ;-) Kevin On Sep 28, 2016, at 10:23 AM, Marc A Kaplan wrote: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Sep 29 16:38:56 2016 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 29 Sep 2016 17:38:56 +0200 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Thu Sep 29 19:00:40 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 29 Sep 2016 11:00:40 -0700 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: > to the question. If I were to hypothetically use a 256K metadata > block size, does the ?1/32nd of a block? come into play like it does > for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing > *2* inodes (assuming 4K inode size) minimum? I think the point of confusion here is minimum allocation size vs minimum IO size -- those two are not one and the same. In fact in GPFS those are largely unrelated values. For low-level metadata files where multiple records are packed into the same block, it is possible to read/write either an individual record (such as an inode), or an entire block of records (which is what happens, for example, during inode copy-on-write). The minimum IO size in GPFS is 512 bytes. On a "4K-aligned" file system, GPFS vows to only do IOs in multiples of 4KiB. For data, GPFS tracks what portion of a given block is valid/dirty using an in-memory bitmap, and if 4K in the middle of a 16M block are modified, only 4K get written, not 16M (although this is more complicated for sparse file writes and appends, when some areas need to be zeroed out). For metadata writes, entire metadata objects are written, using the actual object size, rounded up to the nearest 512B or 4K boundary, as needed. So a single modified inode results in a single inode write, regardless of the metadata block size. If you have snapshots, and the inode being modified needs to be copied to the previous snapshot, and happens to be the first inode in the block that needs a COW, an entire block of inodes is copied to the latest snapshot, as an optimization. > And here?s a really off the wall question ? yesterday we were > discussing the fact that there is now a single inode file. > Historically, we have always used RAID 1 mirrors (first with > spinning disk, as of last fall now on SSD) for metadata and then use > GPFS replication on top of that. But given that there is a single > inode file is that ?old way? of doing things still the right way? > In other words, could we potentially be better off by using a couple > of 8+2P RAID 6 LUNs? The old way is also the modern way in this case. Using RAID1 LUNs for GPFS metadata is still the right approach. You don't want to use RAID erasure codes that trigger read-modify-write for small IOs, which are typical for metadata (unless your RAID array has so much cache as to make RMW a moot point). > One potential downside of that would be that we would then only have > two NSD servers serving up metadata, so we discussed the idea of > taking each RAID 6 LUN and splitting it up into multiple logical > volumes (all that done on the storage array, of course) and then > presenting those to GPFS as NSDs??? Like most performance questions, this one can ultimately only be answered definitively by running tests, but offhand I would suspect that the performance impact of RAID6, combined with extra contention for physical disks, is going to more than offset the benefits of using more NSD servers. Keep in mind that you aren't limited to 2 NSD servers per LUN. If you actually have the connectivity for more than 2 nodes on your RAID controller, GPFS allows up to 8 simultaneously active NSD servers per NSD. yuri > On Sep 28, 2016, at 10:23 AM, Marc A Kaplan wrote: > > OKAY, I'll say it again. inodes are PACKED into a single inode > file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. > There is no wasted space. > > (Of course if you have metadata replication = 2, then yes, double > that. And yes, there overhead for indirect blocks (indices), > allocation maps, etc, etc.) > > And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good > choice for your data distribution, to optimize packing of data and/ > or directories into inodes... > > Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... > > mmcrfs x2K -i 2048 > > [root at n2 charts]# mmlsfs x2K -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 2048 Inode size in bytes > > Works for me! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Fri Sep 30 06:43:53 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 29 Sep 2016 22:43:53 -0700 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: The issue of "GNR as software" is a pretty convoluted mixture of technical, business, and resource constraints issues. While some of the technical issues can be discussed here, obviously the other considerations cannot be discussed in a public forum. So you won't be able to get a complete understanding of the situation by discussing it here. > I understand the support concerns, but I naively thought that assuming > the hardware meets a basic set of requirements (e.g. redundant sas > paths, x type of drives) it would be fairly supportable with GNR. The > DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla > I thought. Setting business issues aside, this is more complicated on the technical level than one may think. At present, GNR requires a set of twin-tailed external disk enclosures. This is not a particularly exotic kind of hardware, but it turns out that this corner of the storage world is quite insular. GNR has a very close relationship with physical disk devices, much more so than regular GPFS. In an ideal world, SCSI and SES standards are supposed to provide a framework which would allow software like GNR to operate on an arbitrary disk enclosure. In the real world, the actual SES implementations on various enclosures that we've been dealing with are, well, peculiar. Apparently SES is one of those standards where vendors feel a lot of freedom in "re-interpreting" the standard, and since typically enclosures talk to a small set of RAID controllers, there aren't bad enough consequences to force vendors to be religious about SES standard compliance. Furthermore, the SAS fabric topology in configurations with an external disk enclosures is surprisingly complex, and that complexity predictably leads to complex failures which don't exist in simpler configurations. Thus far, every single one of the five enclosures we've had a chance to run GNR on required some adjustments, workarounds, hacks, etc. And the consequences of a misbehaving SAS fabric can be quite dire. There are various approaches to dealing with those complications, from running a massive 3rd party hardware qualification program to basically declaring any complications from an unknown enclosure to be someone else's problem (how would ZFS deal with a SCSI reset storm due to a bad SAS expander?), but there's much debate on what is the right path to take. Customer input/feedback is obviously very valuable in tilting such discussions in the right direction. yuri From: Aaron Knister To: gpfsug-discuss at spectrumscale.org, Date: 09/28/2016 06:44 PM Subject: Re: [gpfsug-discuss] gpfs native raid Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stef.coene at docum.org Fri Sep 30 14:03:01 2016 From: stef.coene at docum.org (Stef Coene) Date: Fri, 30 Sep 2016 15:03:01 +0200 Subject: [gpfsug-discuss] Toolkit Message-ID: Hi, When using the toolkit, all config data is stored in clusterdefinition.txt When you modify the cluster with mm* commands, the toolkit is unaware of these changes. Is it possible to recreate the clusterdefinition.txt based on the current configuration? Stef From matthew at ellexus.com Fri Sep 30 16:30:11 2016 From: matthew at ellexus.com (Matthew Harris) Date: Fri, 30 Sep 2016 16:30:11 +0100 Subject: [gpfsug-discuss] Introduction from Ellexus Message-ID: Hello everyone, Ellexus is the IO profiling company with tools for load balancing shared storage, solving IO performance issues and detecting rogue jobs that have bad IO patterns. We have a good number of customers who use Spectrum Scale so we do a lot of work to support it. We have clients and partners working across the HPC space including semiconductor, life sciences, oil and gas, automotive and finance. We're based in Cambridge, England. Some of you will have already met our CEO, Rosemary Francis. Looking forward to meeting you at SC16. Matthew Harris Account Manager & Business Development - Ellexus Ltd *www.ellexus.com * *Ellexus Ltd is a limited company registered in England & Wales* *Company registration no. 07166034* *Registered address: 198 High Street, Tonbridge, Kent TN9 1BE, UK* *Operating address: St John's Innovation Centre, Cowley Road, Cambridge CB4 0WS, UK* -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 30 21:56:29 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 30 Sep 2016 16:56:29 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: <2f59d32a-fc0f-3f03-dd95-3465611dc841@nasa.gov> Thanks, Yuri. Your replies are always quite enjoyable to read. I didn't realize SES was such a loosely interpreted standard, I just assumed it was fairly straightforward. We've got a number of JBODs here we manage via SES using the linux enclosure module (e.g. /sys/class/enclosure) and they seem to "just work" but we're not doing anything terribly advanced, mostly just turning on/off various status LEDs. I should clarify, the newer SAS enclosures I've encountered seem quite good, some of the older enclosures (e.g. in particular the Xyratex enclosure used by DDN in it's S2A units) were a real treat to interact with and didn't seem to follow the SES standard in spirit. I can certainly accept the complexity argument here. I think for our purposes a "reasonable level" of support would be all we're after. I'm not sure how ZFS would deal with a SCSI reset storm, I suspect the pool would just offline itself if enough paths seemed to disappear or timeout. If I could make GPFS work well with ZFS as the underlying storage target I would be quite happy. So far I have struggled to make it performant. GPFS seems to assume once a block device accepts the write that it's committed to stable storage. With ZFS ZVOL's this isn't the case by default. Making it the case (setting the sync=always paremter) causes a *massive* degradation in performance. If GPFS were to issue sync commands at appropriate intervals then I think we could make this work well. I'm not sure how to go about this, though, and given frequent enough scsi sync commands to a given lun its performance would likely degrade to the current state of zfs with sync=always. At any rate, we'll see how things go. Thanks again. -Aaron On 9/30/16 1:43 AM, Yuri L Volobuev wrote: > The issue of "GNR as software" is a pretty convoluted mixture of > technical, business, and resource constraints issues. While some of the > technical issues can be discussed here, obviously the other > considerations cannot be discussed in a public forum. So you won't be > able to get a complete understanding of the situation by discussing it here. > >> I understand the support concerns, but I naively thought that assuming >> the hardware meets a basic set of requirements (e.g. redundant sas >> paths, x type of drives) it would be fairly supportable with GNR. The >> DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla >> I thought. > > Setting business issues aside, this is more complicated on the technical > level than one may think. At present, GNR requires a set of twin-tailed > external disk enclosures. This is not a particularly exotic kind of > hardware, but it turns out that this corner of the storage world is > quite insular. GNR has a very close relationship with physical disk > devices, much more so than regular GPFS. In an ideal world, SCSI and > SES standards are supposed to provide a framework which would allow > software like GNR to operate on an arbitrary disk enclosure. In the > real world, the actual SES implementations on various enclosures that > we've been dealing with are, well, peculiar. Apparently SES is one of > those standards where vendors feel a lot of freedom in "re-interpreting" > the standard, and since typically enclosures talk to a small set of RAID > controllers, there aren't bad enough consequences to force vendors to be > religious about SES standard compliance. Furthermore, the SAS fabric > topology in configurations with an external disk enclosures is > surprisingly complex, and that complexity predictably leads to complex > failures which don't exist in simpler configurations. Thus far, every > single one of the five enclosures we've had a chance to run GNR on > required some adjustments, workarounds, hacks, etc. And the > consequences of a misbehaving SAS fabric can be quite dire. There are > various approaches to dealing with those complications, from running a > massive 3rd party hardware qualification program to basically declaring > any complications from an unknown enclosure to be someone else's problem > (how would ZFS deal with a SCSI reset storm due to a bad SAS expander?), > but there's much debate on what is the right path to take. Customer > input/feedback is obviously very valuable in tilting such discussions in > the right direction. > > yuri > > Inactive hide details for Aaron Knister ---09/28/2016 06:44:23 > PM---Thanks Everyone for your replies! (Quick disclaimer, these Aaron > Knister ---09/28/2016 06:44:23 PM---Thanks Everyone for your replies! > (Quick disclaimer, these opinions are my own, and not those of my > > From: Aaron Knister > To: gpfsug-discuss at spectrumscale.org, > Date: 09/28/2016 06:44 PM > Subject: Re: [gpfsug-discuss] gpfs native raid > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Thanks Everyone for your replies! (Quick disclaimer, these opinions are > my own, and not those of my employer or NASA). > > Not knowing what's coming at the NDA session, it seems to boil down to > "it ain't gonna happen" because of: > > - Perceived difficulty in supporting whatever creative hardware > solutions customers may throw at it. > > I understand the support concerns, but I naively thought that assuming > the hardware meets a basic set of requirements (e.g. redundant sas > paths, x type of drives) it would be fairly supportable with GNR. The > DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla > I thought. > > - IBM would like to monetize the product and compete with the likes of > DDN/Seagate > > This is admittedly a little disappointing. GPFS as long as I've known it > has been largely hardware vendor agnostic. To see even a slight shift > towards hardware vendor lockin and certain features only being supported > and available on IBM hardware is concerning. It's not like the software > itself is free. Perhaps GNR could be a paid add-on license for non-IBM > hardware? Just thinking out-loud. > > The big things I was looking to GNR for are: > > - end-to-end checksums > - implementing a software RAID layer on (in my case enterprise class) JBODs > > I can find a way to do the second thing, but the former I cannot. > Requiring IBM hardware to get end-to-end checksums is a huge red flag > for me. That's something Lustre will do today with ZFS on any hardware > ZFS will run on (and for free, I might add). I would think GNR being > openly available to customers would be important for GPFS to compete > with Lustre. Furthermore, I had opened an RFE (#84523) a while back to > implement checksumming of data for non-GNR environments. The RFE was > declined because essentially it would be too hard and it already exists > for GNR. Well, considering I don't have a GNR environment, and hardware > vendor lock in is something many sites are not interested in, that's > somewhat of a problem. > > I really hope IBM reconsiders their stance on opening up GNR. The > current direction, while somewhat understandable, leaves a really bad > taste in my mouth and is one of the (very few, in my opinion) features > Lustre has over GPFS. > > -Aaron > > > On 9/1/16 9:59 AM, Marc A Kaplan wrote: >> I've been told that it is a big leap to go from supporting GSS and ESS >> to allowing and supporting native raid for customers who may throw >> together "any" combination of hardware they might choose. >> >> In particular the GNR "disk hospital" functions... >> https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm >> will be tricky to support on umpteen different vendor boxes -- and keep >> in mind, those will be from IBM competitors! >> >> That said, ESS and GSS show that IBM has some good tech in this area and >> IBM has shown with the Spectrum Scale product (sans GNR) it can support >> just about any semi-reasonable hardware configuration and a good slew of >> OS versions and architectures... Heck I have a demo/test version of GPFS >> running on a 5 year old Thinkpad laptop.... And we have some GSSs in the >> lab... Not to mention Power hardware and mainframe System Z (think 360, >> 370, 290, Z) >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Thu Sep 1 00:40:13 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 31 Aug 2016 19:40:13 -0400 Subject: [gpfsug-discuss] Data Replication In-Reply-To: References: Message-ID: You can leave out the WHERE ... AND POOL_NAME LIKE 'deep' - that is redundant with the FROM POOL 'deep' clause. In fact at a slight additional overhead in mmapplypolicy processing due to begin checked a little later in the game, you can leave out MISC_ATTRIBUTES NOT LIKE '%2%' since the code is smart enough to not operate on files already marked as replicate(2). I believe mmapplypolicy .... -I yes means do any necessary data movement and/or replication "now" Alternatively you can say -I defer, which will leave the files "ill-replicated" and then fix them up with mmrestripefs later. The -I yes vs -I defer choice is the same as for mmchattr. Think of mmapplypolicy as a fast, parallel way to do find ... | xargs mmchattr ... Advert: see also samples/ilm/mmfind -- the latest version should have an -xargs option From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 08/31/2016 04:44 PM Subject: Re: [gpfsug-discuss] Data Replication Sent by: gpfsug-discuss-bounces at spectrumscale.org Assuming your DeepFlash pool is named "deep", something like the following should work: RULE 'deepreplicate' migrate from pool 'deep' to pool 'deep' replicate(2) where MISC_ATTRIBUTES NOT LIKE '%2%' and POOL_NAME LIKE 'deep' "mmapplypolicy gpfs0 -P replicate-policy.pol -I yes" and possibly "mmrestripefs gpfs0 -r" afterwards. -jf On Wed, Aug 31, 2016 at 8:01 PM, Brian Marshall wrote: Daniel, So here's my use case: I have a Sandisk IF150 (branded as DeepFlash recently) with 128TB of flash acting as a "fast tier" storage pool in our HPC scratch file system. Can I set the filesystem replication level to 1 then write a policy engine rule to send small and/or recent files to the IF150 with a replication of 2? Any other comments on the proposed usage strategy are helpful. Thank you, Brian Marshall On Wed, Aug 31, 2016 at 10:32 AM, Daniel Kidger wrote: The other 'Exception' is when a rule is used to convert a 1 way replicated file to 2 way, or when only one failure group is up due to HW problems. It that case the (re-replication) is done by whatever nodes are used for the rule or command-line, which may include an NSD server. Daniel IBM Spectrum Storage Software +44 (0)7818 522266 Sent from my iPad using IBM Verse On 30 Aug 2016, 19:53:31, mimarsh2 at vt.edu wrote: From: mimarsh2 at vt.edu To: gpfsug-discuss at spectrumscale.org Cc: Date: 30 Aug 2016 19:53:31 Subject: Re: [gpfsug-discuss] Data Replication Thanks. This confirms the numbers that I am seeing. Brian On Tue, Aug 30, 2016 at 2:50 PM, Laurence Horrocks-Barlow < laurence at qsplace.co.uk> wrote: Its the client that does all the synchronous replication, this way the cluster is able to scale as the clients do the leg work (so to speak). The somewhat "exception" is if a GPFS NSD server (or client with direct NSD) access uses a server bases protocol such as SMB, in this case the SMB server will do the replication as the SMB client doesn't know about GPFS or its replication; essentially the SMB server is the GPFS client. -- Lauz On 30 August 2016 17:03:38 CEST, Bryan Banister wrote: The NSD Client handles the replication and will, as you stated, write one copy to one NSD (using the primary server for this NSD) and one to a different NSD in a different GPFS failure group (using quite likely, but not necessarily, a different NSD server that is the primary server for this alternate NSD). Cheers, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto: gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Brian Marshall Sent: Tuesday, August 30, 2016 9:59 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Data Replication All, If I setup a filesystem to have data replication of 2 (2 copies of data), does the data get replicated at the NSD Server or at the client? i.e. Does the client send 2 copies over the network or does the NSD Server get a single copy and then replicate on storage NSDs? I couldn't find a place in the docs that talked about this specific point. Thank you, Brian Marshall Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Sep 1 11:29:48 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 1 Sep 2016 10:29:48 +0000 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Aaron, GNR is a key differentiator for IBM's (and Lenovo's) Storage hardware appliance. ESS and GSS are otherwise commodity storage arrays connected to commodity NSD servers, albeit with a high degree of tuning and rigorous testing and validation. This competes with equivalent DDN and Seagate appliances as well other non s/w Raid offerings from other IBM partners. GNR only works for a small number of disk arrays and then only in certain configurations. GNR then might be thought of as 'firmware' for the hardware rather than part of a software defined products at is Spectrum Scale. If you beleive the viewpoint that hardware Raid 'is dead' then GNR will not be the only s/w Raid that will be used to underly Spectrum Scale. As well as vendor specific offerings from DDN, Seagate, etc. ZFS is likely to be a popular choice but is today not well understood or tested. This will change as more 3rd parties publish their experiences and tuning optimisations, and also as storage solution vendors bidding Spectrum Scale find they can't compete without a software Raid component in their offering. Disclaimer: the above are my own views and not necessarily an IBM official viewpoint. Daniel IBM Spectrum Storage Software +44 (0)7818 522266 Sent from my iPad using IBM Verse On 30 Aug 2016, 18:17:01, aaron.s.knister at nasa.gov wrote: From: aaron.s.knister at nasa.gov To: gpfsug-discuss at spectrumscale.org Cc: Date: 30 Aug 2016 18:17:01 Subject: Re: [gpfsug-discuss] gpfs native raid Thanks Christopher. I've tried GPFS on zvols a couple times and the write throughput I get is terrible because of the required sync=always parameter. Perhaps a couple of SSD's could help get the number up, though. -Aaron On 8/30/16 12:47 PM, Christopher Maestas wrote: > Interestingly enough, Spectrum Scale can run on zvols. Check out: > > http://files.gpfsug.org/presentations/2016/anl-june/LANL_GPFS_ZFS.pdf > > -cdm > > ------------------------------------------------------------------------ > On Aug 30, 2016, 9:17:05 AM, aaron.s.knister at nasa.gov wrote: > > From: aaron.s.knister at nasa.gov > To: gpfsug-discuss at spectrumscale.org > Cc: > Date: Aug 30, 2016 9:17:05 AM > Subject: [gpfsug-discuss] gpfs native raid > > Does anyone know if/when we might see gpfs native raid opened up for the > masses on non-IBM hardware? It's hard to answer the question of "why > can't GPFS do this? Lustre can" in regards to Lustre's integration with > ZFS and support for RAID on commodity hardware. > -Aaron > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discussUnless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Sep 1 12:22:47 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 1 Sep 2016 11:22:47 +0000 Subject: [gpfsug-discuss] Maximum value for data replication? In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bauer at cesnet.cz Thu Sep 1 14:30:23 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Thu, 1 Sep 2016 15:30:23 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks Message-ID: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Hello, I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the filesystem metadata from state: -m = 2 (default metadata replicas) - SATA disks (dataAndMetadata, failGroup=1) - SSDs (metadataOnly, failGroup=3) to the desired state: -m = 1 - SATA disks (dataOnly, failGroup=1) - SSDs (metadataOnly, failGroup=3) I have done the following steps in the following order: 1) change SATA disks to dataOnly (stanza file modifies the 'usage' attribute only): # mmchdisk fs1 change -F dataOnly_disks.stanza Attention: Disk parameters were changed. Use the mmrestripefs command with the -r option to relocate data and metadata. Verifying file system configuration information ... mmchdisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. 2) change default metadata replicas number 2->1 # mmchfs fs1 -m 1 3) run mmrestripefs as suggested by output of 1) # mmrestripefs fs1 -r Scanning file system metadata, phase 1 ... Error processing inodes. No space left on device mmrestripefs: Command failed. Examine previous error messages to determine cause. It is, however, still possible to create new files on the filesystem. When I return one of the SATA disks as a dataAndMetadata disk, the mmrestripefs command stops complaining about No space left on device. Both df and mmdf say that there is enough space both for data (SATA) and metadata (SSDs). Does anyone have an idea why is it complaining? Thanks, -- Miroslav Bauer -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From aaron.s.knister at nasa.gov Thu Sep 1 14:36:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 09:36:32 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: I must admit, I'm curious as to the reason you're dropping the replication factor from 2 down to 1. There are some serious advantages we've seen to having multiple metadata replicas, as far as error recovery is concerned. Could you paste an output of mmlsdisk for the filesystem? -Aaron On 9/1/16 9:30 AM, Miroslav Bauer wrote: > Hello, > > I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > filesystem metadata from state: > -m = 2 (default metadata replicas) > - SATA disks (dataAndMetadata, failGroup=1) > - SSDs (metadataOnly, failGroup=3) > to the desired state: > -m = 1 > - SATA disks (dataOnly, failGroup=1) > - SSDs (metadataOnly, failGroup=3) > > I have done the following steps in the following order: > 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > attribute only): > # mmchdisk fs1 change -F dataOnly_disks.stanza > Attention: Disk parameters were changed. > Use the mmrestripefs command with the -r option to relocate data and > metadata. > Verifying file system configuration information ... > mmchdisk: Propagating the cluster configuration data to all > affected nodes. This is an asynchronous process. > > 2) change default metadata replicas number 2->1 > # mmchfs fs1 -m 1 > > 3) run mmrestripefs as suggested by output of 1) > # mmrestripefs fs1 -r > Scanning file system metadata, phase 1 ... > Error processing inodes. > No space left on device > mmrestripefs: Command failed. Examine previous error messages to > determine cause. > > It is, however, still possible to create new files on the filesystem. > When I return one of the SATA disks as a dataAndMetadata disk, the > mmrestripefs > command stops complaining about No space left on device. Both df and mmdf > say that there is enough space both for data (SATA) and metadata (SSDs). > Does anyone have an idea why is it complaining? > > Thanks, > > -- > Miroslav Bauer > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Sep 1 14:39:17 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 09:39:17 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: By the way, I suspect the no space on device errors are because GPFS believes for some reason that it is unable to maintain the metadata replication factor of 2 that's likely set on all previously created inodes. On 9/1/16 9:36 AM, Aaron Knister wrote: > I must admit, I'm curious as to the reason you're dropping the > replication factor from 2 down to 1. There are some serious advantages > we've seen to having multiple metadata replicas, as far as error > recovery is concerned. > > Could you paste an output of mmlsdisk for the filesystem? > > -Aaron > > On 9/1/16 9:30 AM, Miroslav Bauer wrote: >> Hello, >> >> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >> filesystem metadata from state: >> -m = 2 (default metadata replicas) >> - SATA disks (dataAndMetadata, failGroup=1) >> - SSDs (metadataOnly, failGroup=3) >> to the desired state: >> -m = 1 >> - SATA disks (dataOnly, failGroup=1) >> - SSDs (metadataOnly, failGroup=3) >> >> I have done the following steps in the following order: >> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >> attribute only): >> # mmchdisk fs1 change -F dataOnly_disks.stanza >> Attention: Disk parameters were changed. >> Use the mmrestripefs command with the -r option to relocate data and >> metadata. >> Verifying file system configuration information ... >> mmchdisk: Propagating the cluster configuration data to all >> affected nodes. This is an asynchronous process. >> >> 2) change default metadata replicas number 2->1 >> # mmchfs fs1 -m 1 >> >> 3) run mmrestripefs as suggested by output of 1) >> # mmrestripefs fs1 -r >> Scanning file system metadata, phase 1 ... >> Error processing inodes. >> No space left on device >> mmrestripefs: Command failed. Examine previous error messages to >> determine cause. >> >> It is, however, still possible to create new files on the filesystem. >> When I return one of the SATA disks as a dataAndMetadata disk, the >> mmrestripefs >> command stops complaining about No space left on device. Both df and mmdf >> say that there is enough space both for data (SATA) and metadata (SSDs). >> Does anyone have an idea why is it complaining? >> >> Thanks, >> >> -- >> Miroslav Bauer >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan at buzzard.me.uk Thu Sep 1 14:49:11 2016 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 01 Sep 2016 14:49:11 +0100 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: <1472737751.25479.22.camel@buzzard.phy.strath.ac.uk> On Thu, 2016-09-01 at 09:39 -0400, Aaron Knister wrote: > By the way, I suspect the no space on device errors are because GPFS > believes for some reason that it is unable to maintain the metadata > replication factor of 2 that's likely set on all previously created inodes. > Hazarding a guess, but there is only one SSD NSD, so if all the metadata is going to go on SSD there is no point in replicating. It would also explain why it would believe it can't maintain the metadata replication factor. Though it could just be a simple metadata size is larger than the available SSD size. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From makaplan at us.ibm.com Thu Sep 1 14:59:28 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 1 Sep 2016 09:59:28 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: I've been told that it is a big leap to go from supporting GSS and ESS to allowing and supporting native raid for customers who may throw together "any" combination of hardware they might choose. In particular the GNR "disk hospital" functions... https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm will be tricky to support on umpteen different vendor boxes -- and keep in mind, those will be from IBM competitors! That said, ESS and GSS show that IBM has some good tech in this area and IBM has shown with the Spectrum Scale product (sans GNR) it can support just about any semi-reasonable hardware configuration and a good slew of OS versions and architectures... Heck I have a demo/test version of GPFS running on a 5 year old Thinkpad laptop.... And we have some GSSs in the lab... Not to mention Power hardware and mainframe System Z (think 360, 370, 290, Z) -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Sep 1 15:02:50 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 1 Sep 2016 10:02:50 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> Message-ID: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Oh! I think you've already provided the info I was looking for :) I thought that failGroup=3 meant there were 3 failure groups within the SSDs. I suspect that's not at all what you meant and that actually is the failure group of all of those disks. That I think explains what's going on-- there's only one failure group's worth of metadata-capable disks available and as such GPFS can't place the 2nd replica for existing files. Here's what I would suggest: - Create at least 2 failure groups within the SSDs - Put the default metadata replication factor back to 2 - Run a restripefs -R to shuffle files around and restore the metadata replication factor of 2 to any files created while it was set to 1 If you're not interested in replication for metadata then perhaps all you need to do is the mmrestripefs -R. I think that should un-replicate the file from the SATA disks leaving the copy on the SSDs. Hope that helps. -Aaron On 9/1/16 9:39 AM, Aaron Knister wrote: > By the way, I suspect the no space on device errors are because GPFS > believes for some reason that it is unable to maintain the metadata > replication factor of 2 that's likely set on all previously created inodes. > > On 9/1/16 9:36 AM, Aaron Knister wrote: >> I must admit, I'm curious as to the reason you're dropping the >> replication factor from 2 down to 1. There are some serious advantages >> we've seen to having multiple metadata replicas, as far as error >> recovery is concerned. >> >> Could you paste an output of mmlsdisk for the filesystem? >> >> -Aaron >> >> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>> Hello, >>> >>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>> filesystem metadata from state: >>> -m = 2 (default metadata replicas) >>> - SATA disks (dataAndMetadata, failGroup=1) >>> - SSDs (metadataOnly, failGroup=3) >>> to the desired state: >>> -m = 1 >>> - SATA disks (dataOnly, failGroup=1) >>> - SSDs (metadataOnly, failGroup=3) >>> >>> I have done the following steps in the following order: >>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>> attribute only): >>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>> Attention: Disk parameters were changed. >>> Use the mmrestripefs command with the -r option to relocate data and >>> metadata. >>> Verifying file system configuration information ... >>> mmchdisk: Propagating the cluster configuration data to all >>> affected nodes. This is an asynchronous process. >>> >>> 2) change default metadata replicas number 2->1 >>> # mmchfs fs1 -m 1 >>> >>> 3) run mmrestripefs as suggested by output of 1) >>> # mmrestripefs fs1 -r >>> Scanning file system metadata, phase 1 ... >>> Error processing inodes. >>> No space left on device >>> mmrestripefs: Command failed. Examine previous error messages to >>> determine cause. >>> >>> It is, however, still possible to create new files on the filesystem. >>> When I return one of the SATA disks as a dataAndMetadata disk, the >>> mmrestripefs >>> command stops complaining about No space left on device. Both df and >>> mmdf >>> say that there is enough space both for data (SATA) and metadata (SSDs). >>> Does anyone have an idea why is it complaining? >>> >>> Thanks, >>> >>> -- >>> Miroslav Bauer >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Thu Sep 1 15:14:18 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 1 Sep 2016 10:14:18 -0400 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Message-ID: I believe the OP left out a step. I am not saying this is a good idea, but ... One must change the replication factors marked in each inode for each file... This could be done using an mmapplypolicy rule: RULE 'one' MIGRATE FROM POOL 'yourdatapool' TO POOL 'yourdatapool' REPLICATE(1,1) (repeat rule for each POOL you have) Put that (those) rules in a file and do a "one time" run like mmapplypolicy yourfilesystem -P /path/to/rule -N nodelist-to-do-this-work -g /filesystem/bigtemp -I defer Then try your restripe again. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From bauer at cesnet.cz Thu Sep 1 15:28:36 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Thu, 1 Sep 2016 16:28:36 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> Message-ID: <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Yes, failure group id is exactly what I meant :). Unfortunately, mmrestripefs with -R behaves the same as with -r. I also believed that mmrestripefs -R is the correct tool for fixing the replication settings on inodes (according to manpages), but I will try possible solutions you and Marc suggested and let you know how it went. Thank you, -- Miroslav Bauer On 09/01/2016 04:02 PM, Aaron Knister wrote: > Oh! I think you've already provided the info I was looking for :) I > thought that failGroup=3 meant there were 3 failure groups within the > SSDs. I suspect that's not at all what you meant and that actually is > the failure group of all of those disks. That I think explains what's > going on-- there's only one failure group's worth of metadata-capable > disks available and as such GPFS can't place the 2nd replica for > existing files. > > Here's what I would suggest: > > - Create at least 2 failure groups within the SSDs > - Put the default metadata replication factor back to 2 > - Run a restripefs -R to shuffle files around and restore the metadata > replication factor of 2 to any files created while it was set to 1 > > If you're not interested in replication for metadata then perhaps all > you need to do is the mmrestripefs -R. I think that should > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > Hope that helps. > > -Aaron > > On 9/1/16 9:39 AM, Aaron Knister wrote: >> By the way, I suspect the no space on device errors are because GPFS >> believes for some reason that it is unable to maintain the metadata >> replication factor of 2 that's likely set on all previously created >> inodes. >> >> On 9/1/16 9:36 AM, Aaron Knister wrote: >>> I must admit, I'm curious as to the reason you're dropping the >>> replication factor from 2 down to 1. There are some serious advantages >>> we've seen to having multiple metadata replicas, as far as error >>> recovery is concerned. >>> >>> Could you paste an output of mmlsdisk for the filesystem? >>> >>> -Aaron >>> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>>> Hello, >>>> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>>> filesystem metadata from state: >>>> -m = 2 (default metadata replicas) >>>> - SATA disks (dataAndMetadata, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> to the desired state: >>>> -m = 1 >>>> - SATA disks (dataOnly, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> >>>> I have done the following steps in the following order: >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>>> attribute only): >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>>> Attention: Disk parameters were changed. >>>> Use the mmrestripefs command with the -r option to relocate data and >>>> metadata. >>>> Verifying file system configuration information ... >>>> mmchdisk: Propagating the cluster configuration data to all >>>> affected nodes. This is an asynchronous process. >>>> >>>> 2) change default metadata replicas number 2->1 >>>> # mmchfs fs1 -m 1 >>>> >>>> 3) run mmrestripefs as suggested by output of 1) >>>> # mmrestripefs fs1 -r >>>> Scanning file system metadata, phase 1 ... >>>> Error processing inodes. >>>> No space left on device >>>> mmrestripefs: Command failed. Examine previous error messages to >>>> determine cause. >>>> >>>> It is, however, still possible to create new files on the filesystem. >>>> When I return one of the SATA disks as a dataAndMetadata disk, the >>>> mmrestripefs >>>> command stops complaining about No space left on device. Both df and >>>> mmdf >>>> say that there is enough space both for data (SATA) and metadata >>>> (SSDs). >>>> Does anyone have an idea why is it complaining? >>>> >>>> Thanks, >>>> >>>> -- >>>> Miroslav Bauer >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From S.J.Thompson at bham.ac.uk Thu Sep 1 22:06:44 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 1 Sep 2016 21:06:44 +0000 Subject: [gpfsug-discuss] Maximum value for data replication? In-Reply-To: References: , , Message-ID: I have two protocol node in each of two data centres. So four protocol nodes in the cluster. Plus I also have a quorum vm which is lockstep/ha so guaranteed to survive in one of the data centres should we lose power. The protocol servers being protocol servers don't have access to the fibre channel storage. And we've seen ces do bad things when the storage cluster it is remotely mounting (and the ces root is on) fails/is under load etc. So the four full copies is about guaranteeing there are two full copies in both data centres. And remember this is only for the cesroot, so lock data for the ces ips, the smb registry I think as well. I was hoping that by making the cesroot in the protocol node cluster rather than a fileset on a remote mounted filesysyem, that it would fix the ces weirdness we see as it would become a local gpfs file system. I guess three copies would maybe work. But also in another cluster, we have been thinking about adding NVMe into NSD servers for metadata and system.log and so I can se there are cases there where having higher numbers of copies would be useful. Yes I take the point that more copies means more load for the client, but in these cases, we aren't thinking about gpfs as the fastest possible hpc file system, but for other infrastructure purposes, which is one of the ways the product seems to be moving. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Daniel Kidger [daniel.kidger at uk.ibm.com] Sent: 01 September 2016 12:22 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Maximum value for data replication? Simon, Hi. Can you explain why you would like a full copy of all blocks on all 4 NSD servers ? Is there a particular use case, and hence an interest from product development? Otherwise remember that with 4 NSD servers, with one failure group per (storage rich) NSD server, then all 4 disk arrays will be loaded equally, as new files will get written to any 3 (or 2 or 1) of the 4 failure groups. Also remember that as you add more replication then there is more network load on the gpfs client as it has to perform all the writes itself. Perhaps someone technical can comment on the logic that determines which '3' out of 4 failure groups, a particular block is written to. Daniel [/spectrum_storage-banne] [Spectrum Scale Logo] Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-07818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Steve Duersch Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: Re: [gpfsug-discuss] Maximum value for data replication? Date: Wed, Aug 31, 2016 1:45 PM >>Is there a maximum value for data replication in Spectrum Scale? The maximum value for replication is 3. Steve Duersch Spectrum Scale RAID 845-433-7902 IBM Poughkeepsie, New York [Inactive hide details for gpfsug-discuss-request---08/30/2016 07:25:24 PM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/30/2016 07:25:24 PM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/30/2016 07:25 PM Subject: gpfsug-discuss Digest, Vol 55, Issue 55 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Maximum value for data replication? (Simon Thompson (Research Computing - IT Services)) 2. greetings (Kevin D Johnson) 3. GPFS 3.5.0 on RHEL 6.8 (Lukas Hejtmanek) 4. Re: GPFS 3.5.0 on RHEL 6.8 (Kevin D Johnson) 5. Re: GPFS 3.5.0 on RHEL 6.8 (mark.bergman at uphs.upenn.edu) 6. Re: *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" (Lukas Hejtmanek) 7. Re: *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Tue, 30 Aug 2016 19:09:05 +0000 From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Maximum value for data replication? Message-ID: Content-Type: text/plain; charset="us-ascii" Is there a maximum value for data replication in Spectrum Scale? I have a number of nsd servers which have local storage and Id like each node to have a full copy of all the data in the file-system, say this value is 4, can I set replication to 4 for data and metadata and have each server have a full copy? These are protocol nodes and multi cluster mount another file system (yes I know not supported) and the cesroot is in the remote file system. On several occasions where GPFS has wibbled a bit, this has caused issues with ces locks, so I was thinking of moving the cesroot to a local filesysyem which is replicated on the local ssds in the protocol nodes. I.e. Its a generally quiet file system as its only ces cluster config. I assume if I stop protocols, rsync the data and then change to the new ces root, I should be able to get this working? Thanks Simon ------------------------------ Message: 2 Date: Tue, 30 Aug 2016 19:43:39 +0000 From: "Kevin D Johnson" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] greetings Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Tue, 30 Aug 2016 22:39:18 +0200 From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: <20160830203917.qptfgqvlmdbzu6wr at ics.muni.cz> Content-Type: text/plain; charset=iso-8859-2 Hello, does it work for anyone? As of kernel 2.6.32-642, GPFS 3.5.0 (including the latest patch 32) does start but does not mount and file system. The internal mount cmd gets stucked. -- Luk?? Hejtm?nek ------------------------------ Message: 4 Date: Tue, 30 Aug 2016 20:51:39 +0000 From: "Kevin D Johnson" To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 5 Date: Tue, 30 Aug 2016 17:07:21 -0400 From: mark.bergman at uphs.upenn.edu To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8 Message-ID: <24437-1472591241.445832 at bR6O.TofS.917u> Content-Type: text/plain; charset="UTF-8" In the message dated: Tue, 30 Aug 2016 22:39:18 +0200, The pithy ruminations from Lukas Hejtmanek on <[gpfsug-discuss] GPFS 3.5.0 on RHEL 6.8> were: => Hello, GPFS 3.5.0.[23..3-0] work for me under [CentOS|ScientificLinux] 6.8, but at kernel 2.6.32-573 and lower. I've found kernel bugs in blk_cloned_rq_check_limits() in later kernel revs that caused multipath errors, resulting in GPFS being unable to find all NSDs and mount the filesystem. I am not updating to a newer kernel until I'm certain this is resolved. I opened a bug with CentOS: https://bugs.centos.org/view.php?id=10997 and began an extended discussion with the (RH & SUSE) developers of that chunk of kernel code. I don't know if an upstream bug has been opened by RH, but see: https://patchwork.kernel.org/patch/9140337/ => => does it work for anyone? As of kernel 2.6.32-642, GPFS 3.5.0 (including the => latest patch 32) does start but does not mount and file system. The internal => mount cmd gets stucked. => => -- => Luk?? Hejtm?nek -- Mark Bergman voice: 215-746-4061 mark.bergman at uphs.upenn.edu fax: 215-614-0266 http://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania PGP Key: http://www.cbica.upenn.edu/sbia/bergman ------------------------------ Message: 6 Date: Wed, 31 Aug 2016 00:02:50 +0200 From: Lukas Hejtmanek To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" Message-ID: <20160830220250.yt6r7gvfq7rlvtcs at ics.muni.cz> Content-Type: text/plain; charset=iso-8859-2 Hello, On Mon, Aug 29, 2016 at 09:20:46AM +0200, Frank Kraemer wrote: > Find the paper here: > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Storage%20Manager/page/Petascale%20Data%20Protection thank you for the paper, I appreciate it. However, I wonder whether it could be extended a little. As it has the title Petascale Data Protection, I think that in Peta scale, you have to deal with millions (well rather hundreds of millions) of files you store in and this is something where TSM does not scale well. Could you give some hints: On the backup site: mmbackup takes ages for: a) scan (try to scan 500M files even in parallel) b) backup - what if 10 % of files get changed - backup process can be blocked several days as mmbackup cannot run in several instances on the same file system, so you have to wait until one run of mmbackup finishes. How long could it take at petascale? On the restore site: how can I restore e.g. 40 millions of file efficiently? dsmc restore '/path/*' runs into serious troubles after say 20M files (maybe wrong internal structures used), however, scanning 1000 more files takes several minutes resulting the dsmc restore never reaches that 40M files. using filelists the situation is even worse. I run dsmc restore -filelist with a filelist consisting of 2.4M files. Running for *two* days without restoring even a single file. dsmc is consuming 100 % CPU. So any hints addressing these issues with really large number of files would be even more appreciated. -- Luk?? Hejtm?nek ------------------------------ Message: 7 Date: Tue, 30 Aug 2016 16:24:59 -0700 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection" Message-ID: Content-Type: text/plain; charset="utf-8" so lets start with some simple questions. when you say mmbackup takes ages, what version of gpfs code are you running ? how do you execute the mmbackup command ? exact parameters would be useful . what HW are you using for the metadata disks ? how much capacity (df -h) and how many inodes (df -i) do you have in the filesystem you try to backup ? sven On Tue, Aug 30, 2016 at 3:02 PM, Lukas Hejtmanek wrote: > Hello, > > On Mon, Aug 29, 2016 at 09:20:46AM +0200, Frank Kraemer wrote: > > Find the paper here: > > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/ > Tivoli%20Storage%20Manager/page/Petascale%20Data%20Protection > > thank you for the paper, I appreciate it. > > However, I wonder whether it could be extended a little. As it has the > title > Petascale Data Protection, I think that in Peta scale, you have to deal > with > millions (well rather hundreds of millions) of files you store in and this > is > something where TSM does not scale well. > > Could you give some hints: > > On the backup site: > mmbackup takes ages for: > a) scan (try to scan 500M files even in parallel) > b) backup - what if 10 % of files get changed - backup process can be > blocked > several days as mmbackup cannot run in several instances on the same file > system, so you have to wait until one run of mmbackup finishes. How long > could > it take at petascale? > > On the restore site: > how can I restore e.g. 40 millions of file efficiently? dsmc restore > '/path/*' > runs into serious troubles after say 20M files (maybe wrong internal > structures used), however, scanning 1000 more files takes several minutes > resulting the dsmc restore never reaches that 40M files. > > using filelists the situation is even worse. I run dsmc restore -filelist > with a filelist consisting of 2.4M files. Running for *two* days without > restoring even a single file. dsmc is consuming 100 % CPU. > > So any hints addressing these issues with really large number of files > would > be even more appreciated. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 55 ********************************************** _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif Type: image/gif Size: 105 bytes Desc: Image.1__=0ABB0AB3DFD67DBA8f9e8a93df938 at us.ibm.com.gif URL: From r.sobey at imperial.ac.uk Fri Sep 2 14:37:26 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 13:37:26 +0000 Subject: [gpfsug-discuss] CES node responding on system IP address Message-ID: Hi all, *Should* a CES node, 4.2.0 OR 4.2.1, be responding on its system IP address? The nodes in my cluster, seemingly randomly, either give me a list of shares, or prompt me to enter a username and password. For example, Start > Run \\cesnode.fqdn I get a prompt for a username and password. If I add the system IP into my hosts file and call it clustername.fqdn it responds normally i.e. no prompt for username or password. Should I be worried about the inconsistencies here? Richard Sobey Storage Area Network (SAN) Analyst Technical Operations, ICT Imperial College London South Kensington 403, City & Guilds Building London SW7 2AZ Tel: +44 (0)20 7594 6915 Email: r.sobey at imperial.ac.uk http://www.imperial.ac.uk/admin-services/ict/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Sep 2 16:10:59 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 15:10:59 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: I?ve verified the upgrade has fixed this issue so thanks again. However I?ve noticed that stopping SMB doesn?t trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Sep 2 16:15:30 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 2 Sep 2016 15:15:30 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: Should it? If you were running nfs and smb, would you necessarily want to fail the ip over? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] Sent: 02 September 2016 16:10 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? I?ve verified the upgrade has fixed this issue so thanks again. However I?ve noticed that stopping SMB doesn?t trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** From r.sobey at imperial.ac.uk Fri Sep 2 16:23:28 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 15:23:28 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) Sent: 02 September 2016 16:16 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Should it? If you were running nfs and smb, would you necessarily want to fail the ip over? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] Sent: 02 September 2016 16:10 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? I've verified the upgrade has fixed this issue so thanks again. However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez Sent: 27 August 2016 13:53 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Hi Richard This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly Just do this. edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py Change if authType == 'ad' and not nodeState.nfsStopped: to nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: You need to stop the gpfs service in each node where you apply the change " after change the lines please use tap key" Enviado desde mi iPhone El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) 2. Re: CES and mmuserauth command (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Message-ID: > Content-Type: text/plain; charset="UTF-8" That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 08/26/2016 04:48 AM Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. I?m trying to stop SMB on a node, but getting the following output: [root at cesnode ~]# mmces service stop smb smb: Request denied. Please stop NFS first [root at cesnode ~]# mmces service list Enabled services: SMB SMB is running As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. Richard_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ Message: 2 Date: Fri, 26 Aug 2016 12:29:31 -0400 From: "Christof Schmitt" > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] CES and mmuserauth command Message-ID: > Content-Type: text/plain; charset="ISO-2022-JP" The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 55, Issue 44 ********************************************** _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ulmer at ulmer.org Fri Sep 2 17:02:44 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 2 Sep 2016 12:02:44 -0400 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. Liberty, -- Stephen > On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: > > A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. > > Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) > Sent: 02 September 2016 16:16 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > > Should it? > > If you were running nfs and smb, would you necessarily want to fail the ip over? > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] > Sent: 02 September 2016 16:10 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > I've verified the upgrade has fixed this issue so thanks again. > > However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. > > Richard > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez > Sent: 27 August 2016 13:53 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > Hi Richard > > This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly > > > Just do this. > > edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py > > > > Change > > if authType == 'ad' and not nodeState.nfsStopped: > > to > > > > nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) > if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: > > > You need to stop the gpfs service in each node where you apply the change > > > " after change the lines please use tap key" > > > > Enviado desde mi iPhone > > El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) > 2. Re: CES and mmuserauth command (Christof Schmitt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Message-ID: > > > > Content-Type: text/plain; charset="UTF-8" > > That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: "Sobey, Richard A" > > To: "'gpfsug-discuss at spectrumscale.org'" > > > Date: 08/26/2016 04:48 AM > Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. > > I?m trying to stop SMB on a node, but getting the following output: > > [root at cesnode ~]# mmces service stop smb > smb: Request denied. Please stop NFS first > > [root at cesnode ~]# mmces service list > Enabled services: SMB > SMB is running > > As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. > > SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. > > Richard_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Message-ID: > > > > Content-Type: text/plain; charset="ISO-2022-JP" > > The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. > > The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and > LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: Jan-Frode Myklebust > > To: gpfsug main discussion list > > Date: 08/26/2016 05:59 AM > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: > > When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). > > > That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: > > ------------------------------------------------------------------------------------------------------ > ??user?name userName > Specifies the user name to be used to perform operations > against the authentication server. The specified user > name must have sufficient permissions to read user and > group attributes from the authentication server. > ------------------------------------------------------------------------------------------------------- > > Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? > > > > -jf_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 55, Issue 44 > ********************************************** > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From laurence at qsplace.co.uk Fri Sep 2 18:54:02 2016 From: laurence at qsplace.co.uk (Laurence Horrors-Barlow) Date: Fri, 2 Sep 2016 19:54:02 +0200 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: Message-ID: <721250E5-767B-4C44-A9E1-5DD255FD4F7D@qsplace.co.uk> I believe the services auto restart on a crash (or kill), a change I noticed between 4.1.1 and 4.2 hence no IP fail over. Suspending a node to force a fail over is possible the most sensible approach. -- Lauz Sent from my iPad > On 2 Sep 2016, at 18:02, Stephen Ulmer wrote: > > I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. > > If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. > > Liberty, > > -- > Stephen > > > >> On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: >> >> A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. >> >> Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) >> Sent: 02 September 2016 16:16 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> >> Should it? >> >> If you were running nfs and smb, would you necessarily want to fail the ip over? >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] >> Sent: 02 September 2016 16:10 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> I've verified the upgrade has fixed this issue so thanks again. >> >> However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. >> >> Richard >> >> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez >> Sent: 27 August 2016 13:53 >> To: gpfsug-discuss at spectrumscale.org >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> >> Hi Richard >> >> This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly >> >> >> Just do this. >> >> edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py >> >> >> >> Change >> >> if authType == 'ad' and not nodeState.nfsStopped: >> >> to >> >> >> >> nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) >> if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: >> >> >> You need to stop the gpfs service in each node where you apply the change >> >> >> " after change the lines please use tap key" >> >> >> >> Enviado desde mi iPhone >> >> El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: >> Send gpfsug-discuss mailing list submissions to >> gpfsug-discuss at spectrumscale.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> or, via email, send a message with subject or body 'help' to >> gpfsug-discuss-request at spectrumscale.org >> >> You can reach the person managing the list at >> gpfsug-discuss-owner at spectrumscale.org >> >> When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." >> >> >> Today's Topics: >> >> 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) >> 2. Re: CES and mmuserauth command (Christof Schmitt) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 26 Aug 2016 12:29:31 -0400 >> From: "Christof Schmitt" > >> To: gpfsug main discussion list > >> Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> Message-ID: >> > >> >> Content-Type: text/plain; charset="UTF-8" >> >> That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. >> >> Regards, >> >> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ >> christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) >> >> >> >> From: "Sobey, Richard A" > >> To: "'gpfsug-discuss at spectrumscale.org'" >> > >> Date: 08/26/2016 04:48 AM >> Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> >> >> Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. >> >> I?m trying to stop SMB on a node, but getting the following output: >> >> [root at cesnode ~]# mmces service stop smb >> smb: Request denied. Please stop NFS first >> >> [root at cesnode ~]# mmces service list >> Enabled services: SMB >> SMB is running >> >> As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. >> >> SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. >> >> Richard_______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Fri, 26 Aug 2016 12:29:31 -0400 >> From: "Christof Schmitt" > >> To: gpfsug main discussion list > >> Subject: Re: [gpfsug-discuss] CES and mmuserauth command >> Message-ID: >> > >> >> Content-Type: text/plain; charset="ISO-2022-JP" >> >> The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. >> >> The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and >> LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. >> >> Regards, >> >> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ >> christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) >> >> >> >> From: Jan-Frode Myklebust > >> To: gpfsug main discussion list > >> Date: 08/26/2016 05:59 AM >> Subject: Re: [gpfsug-discuss] CES and mmuserauth command >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> >> >> >> On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: >> >> When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). >> >> >> That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: >> >> ------------------------------------------------------------------------------------------------------ >> ??user?name userName >> Specifies the user name to be used to perform operations >> against the authentication server. The specified user >> name must have sufficient permissions to read user and >> group attributes from the authentication server. >> ------------------------------------------------------------------------------------------------------- >> >> Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? >> >> >> >> -jf_______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> >> ------------------------------ >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> End of gpfsug-discuss Digest, Vol 55, Issue 44 >> ********************************************** >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From christof.schmitt at us.ibm.com Fri Sep 2 19:20:45 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 2 Sep 2016 11:20:45 -0700 Subject: [gpfsug-discuss] CES and mmuserauth command In-Reply-To: References: Message-ID: After looking into this again, the source of confusion is probably from the fact that there are three different authentication schemes present here: When configuring a LDAP server for file or object authentication, then the specified server, user and password are used during normal operations for querying user data. The same applies for configuring object authentication with AD; AD is here treated as a LDAP server. Configuring AD for file authentication is different in that during the "mmuserauth service create", the machine account is created, and then that account is used to connect to a DC that is chosen from the DCs discovered through DNS and not necessarily the one used for the initial configuration. I submitted an internal request to explain this better in the mmuserauth manpage. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Christof Schmitt/Tucson/IBM at IBMUS To: gpfsug main discussion list Date: 08/26/2016 09:30 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 08/26/2016 05:59 AM Subject: Re: [gpfsug-discuss] CES and mmuserauth command Sent by: gpfsug-discuss-bounces at spectrumscale.org On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: ------------------------------------------------------------------------------------------------------ ??user?name userName Specifies the user name to be used to perform operations against the authentication server. The specified user name must have sufficient permissions to read user and group attributes from the authentication server. ------------------------------------------------------------------------------------------------------- Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? -jf_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From r.sobey at imperial.ac.uk Fri Sep 2 22:02:03 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 2 Sep 2016 21:02:03 +0000 Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? In-Reply-To: References: , Message-ID: That makes more sense putting it that way. Cheers Richard Get Outlook for Android On Fri, Sep 2, 2016 at 5:04 PM +0100, "Stephen Ulmer" > wrote: I think that stopping SMB is an explicitly different assertion than suspending the node, et cetera. When you ask the service to stop, it should stop -- not start a game of whack-a-mole. If you wanted to move the service there are other other ways. If it fails, clearly it the IP address should move. Liberty, -- Stephen > On Sep 2, 2016, at 11:23 AM, Sobey, Richard A wrote: > > A fair point, but since we're not running NFS, a failure of the only other service [SMB], whether it stops through user input or some other means, should cause the node to go unhealthy (in CTDB parlance) and trigger a failover. That would be my preference. > > Otoh if you were running NFS and SMB and one of those services crashed, do you still want a node in the cluster that could potentially respond and fails to do so? I guess it's a question for each organisation to answer themselves. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (Research Computing - IT Services) > Sent: 02 September 2016 16:16 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > > Should it? > > If you were running nfs and smb, would you necessarily want to fail the ip over? > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Sobey, Richard A [r.sobey at imperial.ac.uk] > Sent: 02 September 2016 16:10 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > I've verified the upgrade has fixed this issue so thanks again. > > However I've noticed that stopping SMB doesn't trigger an IP address failover. I think it should. mmces node suspend (or rebooting, or mmces address move, or etc..) seems to trigger the failover. > > Richard > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Danny Alexander Calderon Rodriguez > Sent: 27 August 2016 13:53 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > > Hi Richard > > This is fixed in release 4.2.1, if you cant upgrade now, you can fix this manuallly > > > Just do this. > > edit file /usr/lpp/mmfs/lib/mmcesmon/SMBService.py > > > > Change > > if authType == 'ad' and not nodeState.nfsStopped: > > to > > > > nfsEnabled = utils.isProtocolEnabled("NFS", self.logger) > if authType == 'ad' and not nodeState.nfsStopped and nfsEnabled: > > > You need to stop the gpfs service in each node where you apply the change > > > " after change the lines please use tap key" > > > > Enviado desde mi iPhone > > El 27/08/2016, a las 6:00 a.m., gpfsug-discuss-request at spectrumscale.org escribi?: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Cannot stop SMB... stop NFS first?(Christof Schmitt) > 2. Re: CES and mmuserauth command (Christof Schmitt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Message-ID: > > > > Content-Type: text/plain; charset="UTF-8" > > That would be the case when Active Directory is configured for authentication. In that case the SMB service includes two aspects: One is the actual SMB file server, and the second one is the service for the Active Directory integration. Since NFS depends on authentication and id mapping services, it requires SMB to be running. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: "Sobey, Richard A" > > To: "'gpfsug-discuss at spectrumscale.org'" > > > Date: 08/26/2016 04:48 AM > Subject: [gpfsug-discuss] Cannot stop SMB... stop NFS first? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Sorry all, prepare for a deluge of emails like this, hopefully it?ll help other people implementing CES in the future. > > I?m trying to stop SMB on a node, but getting the following output: > > [root at cesnode ~]# mmces service stop smb > smb: Request denied. Please stop NFS first > > [root at cesnode ~]# mmces service list > Enabled services: SMB > SMB is running > > As you can see there is no way to stop NFS when it?s not running but it seems to be blocking me. It?s happening on all the nodes in the cluster. > > SS version is 4.2.0 running on a fully up to date RHEL 7.1 server. > > Richard_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Aug 2016 12:29:31 -0400 > From: "Christof Schmitt" > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Message-ID: > > > > Content-Type: text/plain; charset="ISO-2022-JP" > > The --user-name option applies to both, AD and LDAP authentication. In the LDAP case, this information is correct. I will try to get some clarification added for the AD case. > > The same applies to the information shown in "service list". There is a common field that holds the information and the parameter from the initial "service create" is stored there. The meaning is different for AD and > LDAP: For LDAP it is the username being used to access the LDAP server, while in the AD case it was only the user initially used until the machine account was created. > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > From: Jan-Frode Myklebust > > To: gpfsug main discussion list > > Date: 08/26/2016 05:59 AM > Subject: Re: [gpfsug-discuss] CES and mmuserauth command > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > On Fri, Aug 26, 2016 at 1:49 AM, Christof Schmitt < christof.schmitt at us.ibm.com> wrote: > > When joinging the AD domain, --user-name, --password and --server are only used to initially identify and logon to the AD and to create the machine account for the cluster. Once that is done, that information is no longer used, and e.g. the account from --user-name could be deleted, the password changed or the specified DC could be removed from the domain (as long as other DCs are remaining). > > > That was my initial understanding of the --user-name, but when reading the man-page I get the impression that it's also used to do connect to AD to do user and group lookups: > > ------------------------------------------------------------------------------------------------------ > ??user?name userName > Specifies the user name to be used to perform operations > against the authentication server. The specified user > name must have sufficient permissions to read user and > group attributes from the authentication server. > ------------------------------------------------------------------------------------------------------- > > Also it's strange that "mmuserauth service list" would list the USER_NAME if it was only somthing that was used at configuration time..? > > > > -jf_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 55, Issue 44 > ********************************************** > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bauer at cesnet.cz Mon Sep 5 14:30:54 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Mon, 5 Sep 2016 15:30:54 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Message-ID: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From janfrode at tanso.net Mon Sep 5 14:51:44 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 05 Sep 2016 13:51:44 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Message-ID: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer : > Hello, > > is there any way to recall a migrated file back to a regular state > (other than renaming a file)? I would like to free some space > on an external pool (TSM), that is being used by migrated files. > And it would be desirable to prevent repeated backups of an > already backed-up data (due to changed ctime/inode). > > I guess that you can acheive only premigrated state with dsmrecall tool > (two copies of file data - one on GPFS pool and one on external pool). > Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think > it's safe, nor clean :). > > Thank you in advance, > > -- > Miroslav Bauer > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bauer at cesnet.cz Mon Sep 5 15:13:42 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Mon, 5 Sep 2016 16:13:42 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> Message-ID: <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: > I believe what you're looking for is dsmrecall -RESident. Plus > reconcile on tsm-server to free up the space. > > Ref: > > http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html > > > -jf > man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: > > Hello, > > is there any way to recall a migrated file back to a regular state > (other than renaming a file)? I would like to free some space > on an external pool (TSM), that is being used by migrated files. > And it would be desirable to prevent repeated backups of an > already backed-up data (due to changed ctime/inode). > > I guess that you can acheive only premigrated state with dsmrecall > tool > (two copies of file data - one on GPFS pool and one on external pool). > Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't > think > it's safe, nor clean :). > > Thank you in advance, > > -- > Miroslav Bauer > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From mark.birmingham at stfc.ac.uk Mon Sep 5 15:27:29 2016 From: mark.birmingham at stfc.ac.uk (mark.birmingham at stfc.ac.uk) Date: Mon, 5 Sep 2016 14:27:29 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> Message-ID: <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> Yes, that's fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.birmingham at stfc.ac.uk Mon Sep 5 15:30:53 2016 From: mark.birmingham at stfc.ac.uk (mark.birmingham at stfc.ac.uk) Date: Mon, 5 Sep 2016 14:30:53 +0000 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> References: <9bfb2882-10de-5f2c-98c1-35ac2ac958a2@cesnet.cz> <55108548-0515-49c8-0e76-fca9b247d337@cesnet.cz> <47B8D67E32CC2D44A587CD18636BECC82BB3A610@exchmbx01> Message-ID: <47B8D67E32CC2D44A587CD18636BECC82BB3A62A@exchmbx01> Sorry All! Noob error - replied to the wrong email!!! Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of mark.birmingham at stfc.ac.uk Sent: 05 September 2016 15:27 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Yes, that's fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer >: Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominic.mueller at de.ibm.com Tue Sep 6 13:04:36 2016 From: dominic.mueller at de.ibm.com (Dominic Mueller-Wicke01) Date: Tue, 6 Sep 2016 14:04:36 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: Hi Miroslav, please use the command: > dsmrecall -resident -detail or use it with file lists Greetings, Dominic. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 06.09.2016 13:00 Subject: gpfsug-discuss Digest, Vol 56, Issue 10 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: DMAPI - Unmigrate file to Regular state (mark.birmingham at stfc.ac.uk) ----- Message from on Mon, 5 Sep 2016 14:30:53 +0000 ----- To: Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Sorry All! Noob error ? replied to the wrong email!!! Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of mark.birmingham at stfc.ac.uk Sent: 05 September 2016 15:27 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state Yes, that?s fine. Just submit the request through SBS. Mark Mark Birmingham Development Team Leader High Performance Systems Group STFC Daresbury Laboratory Phone: +44 (0)1925 603381 Email: mark.birmingham at stfc.ac.uk From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Miroslav Bauer Sent: 05 September 2016 15:14 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state That's right, I must have totally overlooked that! Many thanks! :) -- Miroslav Bauer On 09/05/2016 03:51 PM, Jan-Frode Myklebust wrote: I believe what you're looking for is dsmrecall -RESident. Plus reconcile on tsm-server to free up the space. Ref: http://www.ibm.com/support/knowledgecenter/SSSR2R_7.1.2/com.ibm.itsm.hsmul.doc/r_cmd_dsmrecall.html -jf man. 5. sep. 2016 kl. 15.30 skrev Miroslav Bauer : Hello, is there any way to recall a migrated file back to a regular state (other than renaming a file)? I would like to free some space on an external pool (TSM), that is being used by migrated files. And it would be desirable to prevent repeated backups of an already backed-up data (due to changed ctime/inode). I guess that you can acheive only premigrated state with dsmrecall tool (two copies of file data - one on GPFS pool and one on external pool). Maybe deleting 'dmapi.IBMPMig' xattr will do the trick but I don't think it's safe, nor clean :). Thank you in advance, -- Miroslav Bauer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From volobuev at us.ibm.com Tue Sep 6 20:06:32 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 6 Sep 2016 12:06:32 -0700 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz><2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: The correct way to accomplish what you're looking for (in particular, changing the fs-wide level of replication) is mmrestripefs -R. This command also takes care of moving data off disks now marked metadataOnly. The restripe job hits an error trying to move blocks of the inode file, i.e. before it gets to actual user data blocks. Note that at this point the metadata replication factor is still 2. This suggests one of two possibilities: (1) there isn't enough actual free space on the remaining metadataOnly disks, (2) there isn't enough space in some failure groups to allocate two replicas. All of this assumes you're operating within a single storage pool. If multiple storage pools are in play, there are other possibilities. 'mmdf' output would be helpful in providing more helpful advice. With the information at hand, I can only suggest trying to accomplish the task in two phases: (a) deallocated extra metadata replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do want to point out that metadata replication is a highly recommended insurance policy to have for your file system. As with other kinds of insurance, you may or may not need it, but if you do end up needing it, you'll be very glad you have it. The costs, in terms of extra metadata space and performance overhead, are very reasonable. yuri From: Miroslav Bauer To: gpfsug-discuss at spectrumscale.org, Date: 09/01/2016 07:29 AM Subject: Re: [gpfsug-discuss] Migration to separate metadata and data disks Sent by: gpfsug-discuss-bounces at spectrumscale.org Yes, failure group id is exactly what I meant :). Unfortunately, mmrestripefs with -R behaves the same as with -r. I also believed that mmrestripefs -R is the correct tool for fixing the replication settings on inodes (according to manpages), but I will try possible solutions you and Marc suggested and let you know how it went. Thank you, -- Miroslav Bauer On 09/01/2016 04:02 PM, Aaron Knister wrote: > Oh! I think you've already provided the info I was looking for :) I > thought that failGroup=3 meant there were 3 failure groups within the > SSDs. I suspect that's not at all what you meant and that actually is > the failure group of all of those disks. That I think explains what's > going on-- there's only one failure group's worth of metadata-capable > disks available and as such GPFS can't place the 2nd replica for > existing files. > > Here's what I would suggest: > > - Create at least 2 failure groups within the SSDs > - Put the default metadata replication factor back to 2 > - Run a restripefs -R to shuffle files around and restore the metadata > replication factor of 2 to any files created while it was set to 1 > > If you're not interested in replication for metadata then perhaps all > you need to do is the mmrestripefs -R. I think that should > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > Hope that helps. > > -Aaron > > On 9/1/16 9:39 AM, Aaron Knister wrote: >> By the way, I suspect the no space on device errors are because GPFS >> believes for some reason that it is unable to maintain the metadata >> replication factor of 2 that's likely set on all previously created >> inodes. >> >> On 9/1/16 9:36 AM, Aaron Knister wrote: >>> I must admit, I'm curious as to the reason you're dropping the >>> replication factor from 2 down to 1. There are some serious advantages >>> we've seen to having multiple metadata replicas, as far as error >>> recovery is concerned. >>> >>> Could you paste an output of mmlsdisk for the filesystem? >>> >>> -Aaron >>> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: >>>> Hello, >>>> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the >>>> filesystem metadata from state: >>>> -m = 2 (default metadata replicas) >>>> - SATA disks (dataAndMetadata, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> to the desired state: >>>> -m = 1 >>>> - SATA disks (dataOnly, failGroup=1) >>>> - SSDs (metadataOnly, failGroup=3) >>>> >>>> I have done the following steps in the following order: >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' >>>> attribute only): >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza >>>> Attention: Disk parameters were changed. >>>> Use the mmrestripefs command with the -r option to relocate data and >>>> metadata. >>>> Verifying file system configuration information ... >>>> mmchdisk: Propagating the cluster configuration data to all >>>> affected nodes. This is an asynchronous process. >>>> >>>> 2) change default metadata replicas number 2->1 >>>> # mmchfs fs1 -m 1 >>>> >>>> 3) run mmrestripefs as suggested by output of 1) >>>> # mmrestripefs fs1 -r >>>> Scanning file system metadata, phase 1 ... >>>> Error processing inodes. >>>> No space left on device >>>> mmrestripefs: Command failed. Examine previous error messages to >>>> determine cause. >>>> >>>> It is, however, still possible to create new files on the filesystem. >>>> When I return one of the SATA disks as a dataAndMetadata disk, the >>>> mmrestripefs >>>> command stops complaining about No space left on device. Both df and >>>> mmdf >>>> say that there is enough space both for data (SATA) and metadata >>>> (SSDs). >>>> Does anyone have an idea why is it complaining? >>>> >>>> Thanks, >>>> >>>> -- >>>> Miroslav Bauer >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >> > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bauer at cesnet.cz Wed Sep 7 10:40:19 2016 From: bauer at cesnet.cz (Miroslav Bauer) Date: Wed, 7 Sep 2016 11:40:19 +0200 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz> <2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov> <505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: Hello Yuri, here goes the actual mmdf output of filesystem in question: disk disk size failure holds holds free free name group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 40 TB) dcsh_10C 5T 1 Yes Yes 1.661T ( 33%) 68.48G ( 1%) dcsh_10D 6.828T 1 Yes Yes 2.809T ( 41%) 83.82G ( 1%) dcsh_11C 5T 1 Yes Yes 1.659T ( 33%) 69.01G ( 1%) dcsh_11D 6.828T 1 Yes Yes 2.81T ( 41%) 83.33G ( 1%) dcsh_12C 5T 1 Yes Yes 1.659T ( 33%) 69.48G ( 1%) dcsh_12D 6.828T 1 Yes Yes 2.807T ( 41%) 83.14G ( 1%) dcsh_13C 5T 1 Yes Yes 1.659T ( 33%) 69.35G ( 1%) dcsh_13D 6.828T 1 Yes Yes 2.81T ( 41%) 82.97G ( 1%) dcsh_14C 5T 1 Yes Yes 1.66T ( 33%) 69.06G ( 1%) dcsh_14D 6.828T 1 Yes Yes 2.811T ( 41%) 83.61G ( 1%) dcsh_15C 5T 1 Yes Yes 1.658T ( 33%) 69.38G ( 1%) dcsh_15D 6.828T 1 Yes Yes 2.814T ( 41%) 83.69G ( 1%) dcsd_15D 6.828T 1 Yes Yes 2.811T ( 41%) 83.98G ( 1%) dcsd_15C 5T 1 Yes Yes 1.66T ( 33%) 68.66G ( 1%) dcsd_14D 6.828T 1 Yes Yes 2.81T ( 41%) 84.18G ( 1%) dcsd_14C 5T 1 Yes Yes 1.659T ( 33%) 69.43G ( 1%) dcsd_13D 6.828T 1 Yes Yes 2.81T ( 41%) 83.27G ( 1%) dcsd_13C 5T 1 Yes Yes 1.66T ( 33%) 69.1G ( 1%) dcsd_12D 6.828T 1 Yes Yes 2.81T ( 41%) 83.61G ( 1%) dcsd_12C 5T 1 Yes Yes 1.66T ( 33%) 69.42G ( 1%) dcsd_11D 6.828T 1 Yes Yes 2.811T ( 41%) 83.59G ( 1%) dcsh_10B 5T 1 Yes Yes 1.633T ( 33%) 76.97G ( 2%) dcsh_11A 5T 1 Yes Yes 1.632T ( 33%) 77.29G ( 2%) dcsh_11B 5T 1 Yes Yes 1.633T ( 33%) 76.73G ( 1%) dcsh_12A 5T 1 Yes Yes 1.634T ( 33%) 76.49G ( 1%) dcsd_11C 5T 1 Yes Yes 1.66T ( 33%) 69.25G ( 1%) dcsd_10D 6.828T 1 Yes Yes 2.811T ( 41%) 83.39G ( 1%) dcsh_10A 5T 1 Yes Yes 1.633T ( 33%) 77.06G ( 2%) dcsd_10C 5T 1 Yes Yes 1.66T ( 33%) 69.83G ( 1%) dcsd_15B 5T 1 Yes Yes 1.635T ( 33%) 76.52G ( 1%) dcsd_15A 5T 1 Yes Yes 1.634T ( 33%) 76.24G ( 1%) dcsd_14B 5T 1 Yes Yes 1.634T ( 33%) 76.31G ( 1%) dcsd_14A 5T 1 Yes Yes 1.634T ( 33%) 76.23G ( 1%) dcsd_13B 5T 1 Yes Yes 1.634T ( 33%) 76.13G ( 1%) dcsd_13A 5T 1 Yes Yes 1.634T ( 33%) 76.22G ( 1%) dcsd_12B 5T 1 Yes Yes 1.635T ( 33%) 77.49G ( 2%) dcsd_12A 5T 1 Yes Yes 1.633T ( 33%) 77.13G ( 2%) dcsd_11B 5T 1 Yes Yes 1.633T ( 33%) 76.86G ( 2%) dcsd_11A 5T 1 Yes Yes 1.632T ( 33%) 76.22G ( 1%) dcsd_10B 5T 1 Yes Yes 1.633T ( 33%) 76.79G ( 1%) dcsd_10A 5T 1 Yes Yes 1.633T ( 33%) 77.21G ( 2%) dcsh_15B 5T 1 Yes Yes 1.635T ( 33%) 76.04G ( 1%) dcsh_15A 5T 1 Yes Yes 1.634T ( 33%) 76.84G ( 2%) dcsh_14B 5T 1 Yes Yes 1.635T ( 33%) 76.75G ( 1%) dcsh_14A 5T 1 Yes Yes 1.633T ( 33%) 76.05G ( 1%) dcsh_13B 5T 1 Yes Yes 1.634T ( 33%) 76.35G ( 1%) dcsh_13A 5T 1 Yes Yes 1.634T ( 33%) 76.68G ( 1%) dcsh_12B 5T 1 Yes Yes 1.635T ( 33%) 76.74G ( 1%) ssd_5_5 80G 3 Yes No 22.31G ( 28%) 7.155G ( 9%) ssd_4_4 80G 3 Yes No 22.21G ( 28%) 7.196G ( 9%) ssd_3_3 80G 3 Yes No 22.2G ( 28%) 7.239G ( 9%) ssd_2_2 80G 3 Yes No 22.24G ( 28%) 7.146G ( 9%) ssd_1_1 80G 3 Yes No 22.29G ( 28%) 7.134G ( 9%) ------------- -------------------- ------------------- (pool total) 262.3T 92.96T ( 35%) 3.621T ( 1%) Disks in storage pool: maid4 (Maximum disk size allowed is 466 TB) ...... ------------- -------------------- ------------------- (pool total) 291T 126.5T ( 43%) 562.6G ( 0%) Disks in storage pool: maid5 (Maximum disk size allowed is 466 TB) ...... ------------- -------------------- ------------------- (pool total) 436.6T 120.8T ( 28%) 25.23G ( 0%) Disks in storage pool: maid6 (Maximum disk size allowed is 466 TB) ....... ------------- -------------------- ------------------- (pool total) 582.1T 358.7T ( 62%) 9.458G ( 0%) ============= ==================== =================== (data) 1.535P 698.9T ( 44%) 4.17T ( 0%) (metadata) 262.3T 92.96T ( 35%) 3.621T ( 1%) ============= ==================== =================== (total) 1.535P 699T ( 44%) 4.205T ( 0%) Inode Information ----------------- Number of used inodes: 79607225 Number of free inodes: 82340423 Number of allocated inodes: 161947648 Maximum number of inodes: 1342177280 I have a smaller testing FS with the same setup (with plenty of free space), and the actual sequence of commands that worked for me was: mmchfs fs1 -m1 mmrestripefs fs1 -R mmrestripefs fs1 -b mmchdisk fs1 change -F ~/nsd_metadata_test (dataAndMetadata -> dataOnly) mmrestripefs fs1 -r Could you please evaluate more on the performance overhead with having metadata on SSD+SATA? Are the read operations automatically directed to faster disks by GPFS? Is each write operation waiting for write to be finished by SATA disks? Thank you, -- Miroslav Bauer On 09/06/2016 09:06 PM, Yuri L Volobuev wrote: > > The correct way to accomplish what you're looking for (in particular, > changing the fs-wide level of replication) is mmrestripefs -R. This > command also takes care of moving data off disks now marked metadataOnly. > > The restripe job hits an error trying to move blocks of the inode > file, i.e. before it gets to actual user data blocks. Note that at > this point the metadata replication factor is still 2. This suggests > one of two possibilities: (1) there isn't enough actual free space on > the remaining metadataOnly disks, (2) there isn't enough space in some > failure groups to allocate two replicas. > > All of this assumes you're operating within a single storage pool. If > multiple storage pools are in play, there are other possibilities. > > 'mmdf' output would be helpful in providing more helpful advice. With > the information at hand, I can only suggest trying to accomplish the > task in two phases: (a) deallocated extra metadata replicas, by doing > mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do > want to point out that metadata replication is a highly recommended > insurance policy to have for your file system. As with other kinds of > insurance, you may or may not need it, but if you do end up needing > it, you'll be very glad you have it. The costs, in terms of extra > metadata space and performance overhead, are very reasonable. > > yuri > > > Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is > exactly what I meant :). Unfortunately, mmrestripefs with -R > > From: Miroslav Bauer > To: gpfsug-discuss at spectrumscale.org, > Date: 09/01/2016 07:29 AM > Subject: Re: [gpfsug-discuss] Migration to separate metadata and data > disks > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Yes, failure group id is exactly what I meant :). Unfortunately, > mmrestripefs with -R > behaves the same as with -r. I also believed that mmrestripefs -R is the > correct tool for > fixing the replication settings on inodes (according to manpages), but I > will try possible > solutions you and Marc suggested and let you know how it went. > > Thank you, > -- > Miroslav Bauer > > On 09/01/2016 04:02 PM, Aaron Knister wrote: > > Oh! I think you've already provided the info I was looking for :) I > > thought that failGroup=3 meant there were 3 failure groups within the > > SSDs. I suspect that's not at all what you meant and that actually is > > the failure group of all of those disks. That I think explains what's > > going on-- there's only one failure group's worth of metadata-capable > > disks available and as such GPFS can't place the 2nd replica for > > existing files. > > > > Here's what I would suggest: > > > > - Create at least 2 failure groups within the SSDs > > - Put the default metadata replication factor back to 2 > > - Run a restripefs -R to shuffle files around and restore the metadata > > replication factor of 2 to any files created while it was set to 1 > > > > If you're not interested in replication for metadata then perhaps all > > you need to do is the mmrestripefs -R. I think that should > > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > > > Hope that helps. > > > > -Aaron > > > > On 9/1/16 9:39 AM, Aaron Knister wrote: > >> By the way, I suspect the no space on device errors are because GPFS > >> believes for some reason that it is unable to maintain the metadata > >> replication factor of 2 that's likely set on all previously created > >> inodes. > >> > >> On 9/1/16 9:36 AM, Aaron Knister wrote: > >>> I must admit, I'm curious as to the reason you're dropping the > >>> replication factor from 2 down to 1. There are some serious advantages > >>> we've seen to having multiple metadata replicas, as far as error > >>> recovery is concerned. > >>> > >>> Could you paste an output of mmlsdisk for the filesystem? > >>> > >>> -Aaron > >>> > >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: > >>>> Hello, > >>>> > >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > >>>> filesystem metadata from state: > >>>> -m = 2 (default metadata replicas) > >>>> - SATA disks (dataAndMetadata, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> to the desired state: > >>>> -m = 1 > >>>> - SATA disks (dataOnly, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> > >>>> I have done the following steps in the following order: > >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > >>>> attribute only): > >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza > >>>> Attention: Disk parameters were changed. > >>>> Use the mmrestripefs command with the -r option to relocate > data and > >>>> metadata. > >>>> Verifying file system configuration information ... > >>>> mmchdisk: Propagating the cluster configuration data to all > >>>> affected nodes. This is an asynchronous process. > >>>> > >>>> 2) change default metadata replicas number 2->1 > >>>> # mmchfs fs1 -m 1 > >>>> > >>>> 3) run mmrestripefs as suggested by output of 1) > >>>> # mmrestripefs fs1 -r > >>>> Scanning file system metadata, phase 1 ... > >>>> Error processing inodes. > >>>> No space left on device > >>>> mmrestripefs: Command failed. Examine previous error messages to > >>>> determine cause. > >>>> > >>>> It is, however, still possible to create new files on the filesystem. > >>>> When I return one of the SATA disks as a dataAndMetadata disk, the > >>>> mmrestripefs > >>>> command stops complaining about No space left on device. Both df and > >>>> mmdf > >>>> say that there is enough space both for data (SATA) and metadata > >>>> (SSDs). > >>>> Does anyone have an idea why is it complaining? > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Miroslav Bauer > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >> > > > > > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3716 bytes Desc: S/MIME Cryptographic Signature URL: From S.J.Thompson at bham.ac.uk Wed Sep 7 13:36:48 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Wed, 7 Sep 2016 12:36:48 +0000 Subject: [gpfsug-discuss] Remote cluster mount failing Message-ID: Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon From volobuev at us.ibm.com Wed Sep 7 17:38:03 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Wed, 7 Sep 2016 09:38:03 -0700 Subject: [gpfsug-discuss] Migration to separate metadata and data disks In-Reply-To: References: <7927f34a-28e5-6fc2-a55d-62b2066a08da@cesnet.cz><2ce7334b-28c1-7a14-814a-fbcf99d8049e@nasa.gov><505ff859-d49a-04cc-bd9d-50f7b2a8df0b@cesnet.cz> Message-ID: Hi Miroslav, The mmdf output is very helpful. It suggests very strongly what the problem is: > ssd_5_5?????????????????? 80G??????? 3 Yes????? No?????????? 22.31G ( 28%)??????? 7.155G ( 9%) > ssd_4_4?????????????????? 80G??????? 3 Yes????? No?????????? 22.21G ( 28%)??????? 7.196G ( 9%) > ssd_3_3?????????????????? 80G??????? 3 Yes????? No??????????? 22.2G ( 28%)??????? 7.239G ( 9%) > ssd_2_2?????????????????? 80G??????? 3 Yes????? No?????????? 22.24G ( 28%)??????? 7.146G ( 9%) > ssd_1_1?????????????????? 80G??????? 3 Yes????? No?????????? 22.29G ( 28%)??????? 7.134G ( 9%) >... > ==================== =================== > (data)???????????????? 1.535P??????????????????????????????? 698.9T ( 44%)???????? 4.17T ( 0%) > (metadata)???????????? 262.3T??????????????????????????????? 92.96T ( 35%)??????? 3.621T ( 1%) >... > Number of allocated inodes:? 161947648 > Maximum number of inodes:?? 1342177280 You have 5 80G SSDs. That's not enough. Even with metadata spread across a couple dozen more SATA disks, SSDs are over 3/4 full. There's no way to accurately estimate the amount of metadata in this file system with the data at hand, but if we (very conservatively) assume that each SATA disk has only as much metadata as each SSD, i.e. ~57G, that would greatly exceed the amount of free space available on your SSDs. You need more free metadata space. Another way to look at this: you got 1.5PB of data under management. A reasonable rule-of-thumb estimate for the amount of metadata is 1-2% of the data (this is a typical ratio, but of course every file system is different, and large deviations are possible. A degenerate case is an fs containing nothing but directories, and in this case metadata usage is 100%). So you have to have at least a few TB of metadata storage. 5 80G SSDs aren't enough for an fs of this size. > Could you please evaluate more on the performance overhead with > having metadata > on SSD+SATA? Are the read operations automatically directed to > faster disks by GPFS? > Is each write operation waiting for write to be finished by SATA disks? Mixing disks with sharply different performance characteristics within a single storage pool is detrimental to performance. GPFS stripes blocks across all disks in a storage pool, expecting all of them to be equally suitable. If SSDs are mixed with SATA disks, the overall metadata write performance is going to be bottlenecked by SATA drives. On reads, given a choice of two replicas, GPFS V4.1.1+ picks the the replica residing on the fastest disk, but given that SSDs represent only a small fraction of your total metadata usage, this likely doesn't help a whole lot. You're on the right track in trying to shift all metadata to SSDs and away from SATA, the overall file system performance will improve as the result. yuri > > Thank you, > -- > Miroslav Bauer > On 09/06/2016 09:06 PM, Yuri L Volobuev wrote: > The correct way to accomplish what you're looking for (in > particular, changing the fs-wide level of replication) is > mmrestripefs -R. This command also takes care of moving data off > disks now marked metadataOnly. > > The restripe job hits an error trying to move blocks of the inode > file, i.e. before it gets to actual user data blocks. Note that at > this point the metadata replication factor is still 2. This suggests > one of two possibilities: (1) there isn't enough actual free space > on the remaining metadataOnly disks, (2) there isn't enough space in > some failure groups to allocate two replicas. > > All of this assumes you're operating within a single storage pool. > If multiple storage pools are in play, there are other possibilities. > > 'mmdf' output would be helpful in providing more helpful advice. > With the information at hand, I can only suggest trying to > accomplish the task in two phases: (a) deallocated extra metadata > replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata > off SATA disks. I do want to point out that metadata replication is > a highly recommended insurance policy to have for your file system. > As with other kinds of insurance, you may or may not need it, but if > you do end up needing it, you'll be very glad you have it. The > costs, in terms of extra metadata space and performance overhead, > are very reasonable. > > yuri > > > Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is > exactly what I meant :). Unfortunately, mmrestripefs with -R > > From: Miroslav Bauer > To: gpfsug-discuss at spectrumscale.org, > Date: 09/01/2016 07:29 AM > Subject: Re: [gpfsug-discuss] Migration to separate metadata and data disks > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Yes, failure group id is exactly what I meant :). Unfortunately, > mmrestripefs with -R > behaves the same as with -r. I also believed that mmrestripefs -R is the > correct tool for > fixing the replication settings on inodes (according to manpages), but I > will try possible > solutions you and Marc suggested and let you know how it went. > > Thank you, > -- > Miroslav Bauer > > On 09/01/2016 04:02 PM, Aaron Knister wrote: > > Oh! I think you've already provided the info I was looking for :) I > > thought that failGroup=3 meant there were 3 failure groups within the > > SSDs. I suspect that's not at all what you meant and that actually is > > the failure group of all of those disks. That I think explains what's > > going on-- there's only one failure group's worth of metadata-capable > > disks available and as such GPFS can't place the 2nd replica for > > existing files. > > > > Here's what I would suggest: > > > > - Create at least 2 failure groups within the SSDs > > - Put the default metadata replication factor back to 2 > > - Run a restripefs -R to shuffle files around and restore the metadata > > replication factor of 2 to any files created while it was set to 1 > > > > If you're not interested in replication for metadata then perhaps all > > you need to do is the mmrestripefs -R. I think that should > > un-replicate the file from the SATA disks leaving the copy on the SSDs. > > > > Hope that helps. > > > > -Aaron > > > > On 9/1/16 9:39 AM, Aaron Knister wrote: > >> By the way, I suspect the no space on device errors are because GPFS > >> believes for some reason that it is unable to maintain the metadata > >> replication factor of 2 that's likely set on all previously created > >> inodes. > >> > >> On 9/1/16 9:36 AM, Aaron Knister wrote: > >>> I must admit, I'm curious as to the reason you're dropping the > >>> replication factor from 2 down to 1. There are some serious advantages > >>> we've seen to having multiple metadata replicas, as far as error > >>> recovery is concerned. > >>> > >>> Could you paste an output of mmlsdisk for the filesystem? > >>> > >>> -Aaron > >>> > >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote: > >>>> Hello, > >>>> > >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the > >>>> filesystem metadata from state: > >>>> -m = 2 (default metadata replicas) > >>>> - SATA disks (dataAndMetadata, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> to the desired state: > >>>> -m = 1 > >>>> - SATA disks (dataOnly, failGroup=1) > >>>> - SSDs (metadataOnly, failGroup=3) > >>>> > >>>> I have done the following steps in the following order: > >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage' > >>>> attribute only): > >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza > >>>> Attention: Disk parameters were changed. > >>>> ? Use the mmrestripefs command with the -r option to relocate data and > >>>> metadata. > >>>> Verifying file system configuration information ... > >>>> mmchdisk: Propagating the cluster configuration data to all > >>>> ? affected nodes. ?This is an asynchronous process. > >>>> > >>>> 2) change default metadata replicas number 2->1 > >>>> # mmchfs fs1 -m 1 > >>>> > >>>> 3) run mmrestripefs as suggested by output of 1) > >>>> # mmrestripefs fs1 -r > >>>> Scanning file system metadata, phase 1 ... > >>>> Error processing inodes. > >>>> No space left on device > >>>> mmrestripefs: Command failed. ?Examine previous error messages to > >>>> determine cause. > >>>> > >>>> It is, however, still possible to create new files on the filesystem. > >>>> When I return one of the SATA disks as a dataAndMetadata disk, the > >>>> mmrestripefs > >>>> command stops complaining about No space left on device. Both df and > >>>> mmdf > >>>> say that there is enough space both for data (SATA) and metadata > >>>> (SSDs). > >>>> Does anyone have an idea why is it complaining? > >>>> > >>>> Thanks, > >>>> > >>>> -- > >>>> Miroslav Bauer > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >> > > > > > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Wed Sep 7 17:58:07 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Wed, 7 Sep 2016 09:58:07 -0700 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Valdis.Kletnieks at vt.edu Wed Sep 7 19:45:43 2016 From: Valdis.Kletnieks at vt.edu (Valdis Kletnieks) Date: Wed, 07 Sep 2016 14:45:43 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' Message-ID: <27691.1473273943@turing-police.cc.vt.edu> We're in the middle of deploying Spectrum Archive, and I've hit a snag. We assigned some floating IP addresses, which now need to be changed. So I look at the mmces manpage, and it looks like I need to add the new addresses, and delete the old ones. We're on GPFS 4.2.1.0, if that matters... What 'man mmces' says: 1. To add an address to a specified node, issue this command: mmces address add --ces-node node1 --ces-ip 10.1.2.3 (and at least 6 or 8 more uses of an IP address). What happens when I try it: (And yes, we have an 'isb' ces-group defined with addresses in it already) # mmces address add --ces-group isb --ces-ip 172.28.45.72 Cannot resolve 172.28.45.72; Name or service not known mmces address add: Incorrect value for --ces-ip option Usage: mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group Group] {--ces-ip {IP[,IP...]} Am I missing some special sauce? (My first guess is that it's complaining because there's no PTR in the DNS for that address yet - but if it was going to do DNS lookups, it should be valid to give a hostname rather than an IP address (and nowhere in the manpage does it even *hint* that --ces-ip can be anything other than a list of IP addresses). Or is it time for me to file a PMR? From xhejtman at ics.muni.cz Wed Sep 7 21:11:11 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 7 Sep 2016 22:11:11 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: <20160907201111.xmksazqjekk2ihsy@ics.muni.cz> On Tue, Sep 06, 2016 at 02:04:36PM +0200, Dominic Mueller-Wicke01 wrote: > Hi Miroslav, > > please use the command: > dsmrecall -resident -detail > or use it with file lists well, it looks like Client Version 7, Release 1, Level 4.4 leaks file descriptors: 09/07/2016 21:03:07 ANS1587W Unable to read extended attributes for object /exports/tape_tape/VO_metacentrum/home/jfeit/atlases/atlases/novo3/atlases/images/.svn/prop-base due to errno: 24, reason: Too many open files after about 15 minutes of run, I can see 88 opened files in /proc/$PID/fd when using: dsmrecall -R -RESid -D /path/* is it something known fixed in newer versions? -- Luk?? Hejtm?nek From taylorm at us.ibm.com Wed Sep 7 21:40:13 2016 From: taylorm at us.ibm.com (Michael L Taylor) Date: Wed, 7 Sep 2016 13:40:13 -0700 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: Message-ID: Can't be for certain this is what you're hitting but reverse DNS lookup is documented the KC: http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_protocolnodeipfurtherconfig.htm Note: All CES IPs must have an associated hostname and reverse DNS lookup must be configured for each. For more information, see Adding export IPs in Deploying protocols. http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_deployingprotocolstasks.htm Note: Export IPs must have an associated hostname and reverse DNS lookup must be configured for each. Can you make sure the IPs have reverse DNS lookup and try again? Will get the mmces man page updated for address add -------------- next part -------------- An HTML attachment was scrubbed... URL: From Valdis.Kletnieks at vt.edu Wed Sep 7 22:23:30 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Wed, 07 Sep 2016 17:23:30 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: Message-ID: <41089.1473283410@turing-police.cc.vt.edu> On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > Can't be for certain this is what you're hitting but reverse DNS lookup is > documented the KC: > Note: All CES IPs must have an associated hostname and reverse DNS lookup > must be configured for each. For more information, see Adding export IPs in > Deploying protocols. Bingo. That was it. Since the DNS will take a while to fix, I fed the appropriate entries to /etc/hosts and it worked fine. I got thrown for a loop because if there is enough code to do that checking, it should be able to accept a hostname as well (RFE time? :) From ulmer at ulmer.org Wed Sep 7 22:34:07 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 7 Sep 2016 17:34:07 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: <41089.1473283410@turing-police.cc.vt.edu> References: <41089.1473283410@turing-police.cc.vt.edu> Message-ID: Hostnames can have many A records. IPs *generally* only have one PTR (though it?s not restricted, multiple PTRs is not recommended). Just knowing that you can see why allowing names would create more questions than it answers. So if it did take names instead of IP addresses, it would usually only do what you meant part of the time -- and sometimes none of the time. :) -- Stephen > On Sep 7, 2016, at 5:23 PM, Valdis.Kletnieks at vt.edu wrote: > > On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > >> Can't be for certain this is what you're hitting but reverse DNS lookup is >> documented the KC: > >> Note: All CES IPs must have an associated hostname and reverse DNS lookup >> must be configured for each. For more information, see Adding export IPs in >> Deploying protocols. > > Bingo. That was it. Since the DNS will take a while to fix, I fed > the appropriate entries to /etc/hosts and it worked fine. > > I got thrown for a loop because if there is enough code to do that checking, > it should be able to accept a hostname as well (RFE time? :) > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Valdis.Kletnieks at vt.edu Wed Sep 7 22:54:05 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Wed, 07 Sep 2016 17:54:05 -0400 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: References: <41089.1473283410@turing-police.cc.vt.edu> Message-ID: <43934.1473285245@turing-police.cc.vt.edu> On Wed, 07 Sep 2016 17:34:07 -0400, Stephen Ulmer said: > Hostnames can have many A records. And quad-A records. :) (Despite our best efforts, we're still one of the 100 biggest IPv6 deployments according to http://www.worldipv6launch.org/measurements/ - were's sitting at 84th in traffic volume and 18th by percent penetration, mostly because we deployed it in production literally last century...) From janfrode at tanso.net Thu Sep 8 06:08:47 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 08 Sep 2016 05:08:47 +0000 Subject: [gpfsug-discuss] Weirdness with 'mmces address add' In-Reply-To: <27691.1473273943@turing-police.cc.vt.edu> References: <27691.1473273943@turing-police.cc.vt.edu> Message-ID: I believe your first guess is correct. The ces-ip needs to be resolvable for some reason... Just put a name for it in /etc/hosts, if you can't add it to your dns. -jf ons. 7. sep. 2016 kl. 20.45 skrev Valdis Kletnieks : > We're in the middle of deploying Spectrum Archive, and I've hit a > snag. We assigned some floating IP addresses, which now need to > be changed. So I look at the mmces manpage, and it looks like I need > to add the new addresses, and delete the old ones. > > We're on GPFS 4.2.1.0, if that matters... > > What 'man mmces' says: > > 1. To add an address to a specified node, issue this command: > > mmces address add --ces-node node1 --ces-ip 10.1.2.3 > > (and at least 6 or 8 more uses of an IP address). > > What happens when I try it: (And yes, we have an 'isb' ces-group defined > with > addresses in it already) > > # mmces address add --ces-group isb --ces-ip 172.28.45.72 > Cannot resolve 172.28.45.72; Name or service not known > mmces address add: Incorrect value for --ces-ip option > Usage: > mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group > Group] > {--ces-ip {IP[,IP...]} > > Am I missing some special sauce? (My first guess is that it's complaining > because there's no PTR in the DNS for that address yet - but if it was > going > to do DNS lookups, it should be valid to give a hostname rather than an IP > address (and nowhere in the manpage does it even *hint* that --ces-ip can > be anything other than a list of IP addresses). > > Or is it time for me to file a PMR? > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominic.mueller at de.ibm.com Thu Sep 8 06:35:55 2016 From: dominic.mueller at de.ibm.com (Dominic Mueller-Wicke01) Date: Thu, 8 Sep 2016 07:35:55 +0200 Subject: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state In-Reply-To: References: Message-ID: Please open a PMR for the not working "recall to resident". Some investigation is needed here. Thanks. Greetings, Dominic. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 07.09.2016 23:23 Subject: gpfsug-discuss Digest, Vol 56, Issue 14 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Remote cluster mount failing (Yuri L Volobuev) 2. Weirdness with 'mmces address add' (Valdis Kletnieks) 3. Re: DMAPI - Unmigrate file to Regular state (Lukas Hejtmanek) 4. Weirdness with 'mmces address add' (Michael L Taylor) 5. Re: Weirdness with 'mmces address add' (Valdis.Kletnieks at vt.edu) ----- Message from "Yuri L Volobuev" on Wed, 7 Sep 2016 09:58:07 -0700 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- Message from Valdis Kletnieks on Wed, 07 Sep 2016 14:45:43 -0400 ----- To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Weirdness with 'mmces address add' We're in the middle of deploying Spectrum Archive, and I've hit a snag. We assigned some floating IP addresses, which now need to be changed. So I look at the mmces manpage, and it looks like I need to add the new addresses, and delete the old ones. We're on GPFS 4.2.1.0, if that matters... What 'man mmces' says: 1. To add an address to a specified node, issue this command: mmces address add --ces-node node1 --ces-ip 10.1.2.3 (and at least 6 or 8 more uses of an IP address). What happens when I try it: (And yes, we have an 'isb' ces-group defined with addresses in it already) # mmces address add --ces-group isb --ces-ip 172.28.45.72 Cannot resolve 172.28.45.72; Name or service not known mmces address add: Incorrect value for --ces-ip option Usage: mmces address add [--ces-node Node] [--attribute Attribute] [--ces-group Group] {--ces-ip {IP[,IP...]} Am I missing some special sauce? (My first guess is that it's complaining because there's no PTR in the DNS for that address yet - but if it was going to do DNS lookups, it should be valid to give a hostname rather than an IP address (and nowhere in the manpage does it even *hint* that --ces-ip can be anything other than a list of IP addresses). Or is it time for me to file a PMR? ----- Message from Lukas Hejtmanek on Wed, 7 Sep 2016 22:11:11 +0200 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] DMAPI - Unmigrate file to Regular state On Tue, Sep 06, 2016 at 02:04:36PM +0200, Dominic Mueller-Wicke01 wrote: > Hi Miroslav, > > please use the command: > dsmrecall -resident -detail > or use it with file lists well, it looks like Client Version 7, Release 1, Level 4.4 leaks file descriptors: 09/07/2016 21:03:07 ANS1587W Unable to read extended attributes for object /exports/tape_tape/VO_metacentrum/home/jfeit/atlases/atlases/novo3/atlases/images/.svn/prop-base due to errno: 24, reason: Too many open files after about 15 minutes of run, I can see 88 opened files in /proc/$PID/fd when using: dsmrecall -R -RESid -D /path/* is it something known fixed in newer versions? -- Luk?? Hejtm?nek ----- Message from "Michael L Taylor" on Wed, 7 Sep 2016 13:40:13 -0700 ----- To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Weirdness with 'mmces address add' Can't be for certain this is what you're hitting but reverse DNS lookup is documented the KC: http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_protocolnodeipfurtherconfig.htm Note: All CES IPs must have an associated hostname and reverse DNS lookup must be configured for each. For more information, see Adding export IPs in Deploying protocols. http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1ins_deployingprotocolstasks.htm Note: Export IPs must have an associated hostname and reverse DNS lookup must be configured for each. Can you make sure the IPs have reverse DNS lookup and try again? Will get the mmces man page updated for address add ----- Message from Valdis.Kletnieks at vt.edu on Wed, 07 Sep 2016 17:23:30 -0400 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Weirdness with 'mmces address add' On Wed, 07 Sep 2016 13:40:13 -0700, "Michael L Taylor" said: > Can't be for certain this is what you're hitting but reverse DNS lookup is > documented the KC: > Note: All CES IPs must have an associated hostname and reverse DNS lookup > must be configured for each. For more information, see Adding export IPs in > Deploying protocols. Bingo. That was it. Since the DNS will take a while to fix, I fed the appropriate entries to /etc/hosts and it worked fine. I got thrown for a loop because if there is enough code to do that checking, it should be able to accept a hostname as well (RFE time? :) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Fri Sep 9 15:37:28 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 9 Sep 2016 14:37:28 +0000 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: > on behalf of Yuri L Volobuev > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri [Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to]"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" > To: "gpfsug-discuss at spectrumscale.org" >, Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From volobuev at us.ibm.com Fri Sep 9 17:29:35 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Fri, 9 Sep 2016 09:29:35 -0700 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: It could be "easy" in the end, e.g. regenerating the key ("mmauth genkey new") may fix the issue. Figuring out exactly what is going wrong is messy though, and requires looking at a number of debug data points, something that's awkward to do on a public mailing list. I don't think you want to post certificates et al on a mailing list. The PMR channel is more appropriate for this kind of thing. yuri From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list , Date: 09/09/2016 07:37 AM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: on behalf of Yuri L Volobuev Reply-To: "gpfsug-discuss at spectrumscale.org" < gpfsug-discuss at spectrumscale.org> Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri Inactive hide details for "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to"Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" < S.J.Thompson at bham.ac.uk> To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "graycol.gif" deleted by Yuri L Volobuev/Austin/IBM] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From bbanister at jumptrading.com Sat Sep 10 22:50:25 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Sat, 10 Sep 2016 21:50:25 +0000 Subject: [gpfsug-discuss] Edge Attendees In-Reply-To: References: Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB063297AB@CHI-EXCHANGEW1.w2k.jumptrading.com> Hi Doug, Found out that I get to attend this year. Please put me down for the SS NDA round-table, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Douglas O'flaherty Sent: Monday, August 29, 2016 12:34 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Edge Attendees Greetings: I am organizing an NDA round-table with the IBM Offering Managers at IBM Edge on Tuesday, September 20th at 1pm. The subject will be "The Future of IBM Spectrum Scale." IBM Offering Managers are the Product Owners at IBM. There will be discussions covering licensing, the roadmap for IBM Spectrum Scale RAID (aka GNR), new hardware platforms, etc. This is a unique opportunity to get feedback to the drivers of the IBM Spectrum Scale business plans. It should be a great companion to the content we get from Engineering and Research at most User Group meetings. To get an invitation, please email me privately at douglasof us.ibm.com. All who have a valid NDA are invited. I only need an approximate headcount of attendees. Try not to spam the mailing list. I am pushing to get the Offering Managers to have a similar session at SC16 as an IBM Multi-client Briefing. You can add your voice to that call on this thread, or email me directly. Spectrum Scale User Group at SC16 will once again take place on Sunday afternoon with cocktails to follow. I hope we can blow out the attendance numbers and the number of site speakers we had last year! I know Simon, Bob, and Kristy are already working the agenda. Get your ideas in to them or to me. See you in Vegas, Vegas, SLC, Vegas this Fall... Maybe Australia in between? doug Douglas O'Flaherty IBM Spectrum Storage Marketing ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Sun Sep 11 22:02:48 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Sun, 11 Sep 2016 21:02:48 +0000 Subject: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> Hi Everyone, A while back I seem to recall hearing about a mechanism being developed that would function similarly to Lustre's LNET routers and effectively allow a single set of NSD servers to talk to multiple RDMA fabrics without requiring the NSD servers to have infiniband interfaces on each RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each fabric that would in effect proxy the RDMA requests to the NSD server. Does anyone know what I'm talking about? Just curious if it's still on the roadmap. -Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Sun Sep 11 23:31:56 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Sun, 11 Sep 2016 22:31:56 +0000 Subject: [gpfsug-discuss] Grafana Bridge Code - for GPFS Performance Sensors - Now on the IBM Wiki Message-ID: <2B003708-B2E3-474B-8035-F3A080CB2EAF@nuance.com> IBM has formally published this bridge code - and you can get the details and download it here: IBM Spectrum Scale Performance Monitoring Bridge https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/IBM%20Spectrum%20Scale%20Performance Monitoring%20Bridge Also, see this Storage Community Blog Post (it references version 4.2.2, but I think they mean 4.2.1) http://storagecommunity.org/easyblog/entry/performance-data-graphical-visualization-for-ibm-spectrum-scale-environment I've been using it for a while - if you have any questions, let me know. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 12 01:00:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sun, 11 Sep 2016 20:00:32 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> Message-ID: <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> After some googling around, I wonder if perhaps what I'm thinking of was an I/O forwarding layer that I understood was being developed for x86_64 type machines rather than some type of GPFS protocol router or proxy. -Aaron On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > A while back I seem to recall hearing about a mechanism being developed > that would function similarly to Lustre's LNET routers and effectively > allow a single set of NSD servers to talk to multiple RDMA fabrics > without requiring the NSD servers to have infiniband interfaces on each > RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each > fabric that would in effect proxy the RDMA requests to the NSD server. > Does anyone know what I'm talking about? Just curious if it's still on > the roadmap. > > -Aaron > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From douglasof at us.ibm.com Mon Sep 12 02:38:08 2016 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Sun, 11 Sep 2016 21:38:08 -0400 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 56, Issue 17 In-Reply-To: References: Message-ID: See you... and anyone else who can make it in Vegas in a couple weeks! From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 09/11/2016 07:00 AM Subject: gpfsug-discuss Digest, Vol 56, Issue 17 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Edge Attendees (Bryan Banister) ----- Message from Bryan Banister on Sat, 10 Sep 2016 21:50:25 +0000 ----- To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Edge Attendees Hi Doug, Found out that I get to attend this year. Please put me down for the SS NDA round-table, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Douglas O'flaherty Sent: Monday, August 29, 2016 12:34 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Edge Attendees Greetings: I am organizing an NDA round-table with the IBM Offering Managers at IBM Edge on Tuesday, September 20th at 1pm. The subject will be "The Future of IBM Spectrum Scale." IBM Offering Managers are the Product Owners at IBM. There will be discussions covering licensing, the roadmap for IBM Spectrum Scale RAID (aka GNR), new hardware platforms, etc. This is a unique opportunity to get feedback to the drivers of the IBM Spectrum Scale business plans. It should be a great companion to the content we get from Engineering and Research at most User Group meetings. To get an invitation, please email me privately at douglasof us.ibm.com. All who have a valid NDA are invited. I only need an approximate headcount of attendees. Try not to spam the mailing list. I am pushing to get the Offering Managers to have a similar session at SC16 as an IBM Multi-client Briefing. You can add your voice to that call on this thread, or email me directly. Spectrum Scale User Group at SC16 will once again take place on Sunday afternoon with cocktails to follow. I hope we can blow out the attendance numbers and the number of site speakers we had last year! I know Simon, Bob, and Kristy are already working the agenda. Get your ideas in to them or to me. See you in Vegas, Vegas, SLC, Vegas this Fall... Maybe Australia in between? doug Douglas O'Flaherty IBM Spectrum Storage Marketing Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Mon Sep 12 06:17:05 2016 From: knop at us.ibm.com (Felipe Knop) Date: Mon, 12 Sep 2016 01:17:05 -0400 Subject: [gpfsug-discuss] Remote cluster mount failing In-Reply-To: References: Message-ID: There is a chance the problem might be related to an upgrade from 3.5 to 4.1, or perhaps a remote mount between versions 3.5 and 4.1. It would be useful to know details related to any such migration and different releases when the PMR is opened. Thanks, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Yuri L Volobuev/Austin/IBM at IBMUS To: gpfsug main discussion list Date: 09/09/2016 12:30 PM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org It could be "easy" in the end, e.g. regenerating the key ("mmauth genkey new") may fix the issue. Figuring out exactly what is going wrong is messy though, and requires looking at a number of debug data points, something that's awkward to do on a public mailing list. I don't think you want to post certificates et al on a mailing list. The PMR channel is more appropriate for this kind of thing. yuri "Simon Thompson (Research Computing - IT Services)" ---09/09/2016 07:37:52 AM---That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconf From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list , Date: 09/09/2016 07:37 AM Subject: Re: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org That?s sorta what I was expecting. Though I was hoping someone might have said 'oh just run mmchconfig ....' or something easy. PMR on its way in. Thanks! Simon From: on behalf of Yuri L Volobuev Reply-To: "gpfsug-discuss at spectrumscale.org" < gpfsug-discuss at spectrumscale.org> Date: Wednesday, 7 September 2016 at 17:58 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Remote cluster mount failing It's unclear what's wrong. I'd have two main suspects: (1) TLS protocol version confusion, due to a difference in GSKit version and/or configuration (e.g. NIST SP800 compliance) on two sides (2) firewall. TLS issues are usually messy and tedious to work though. I'd recommend opening a PMR to facilitate debug data collection and analysis. A lot of gory detail may be needed to figure out what's going on. yuri "Simon Thompson (Research Computing - IT Services)" ---09/07/2016 05:37:11 AM---Hi All, I'm trying to get some multi cluster thing working between two of our GPFS From: "Simon Thompson (Research Computing - IT Services)" < S.J.Thompson at bham.ac.uk> To: "gpfsug-discuss at spectrumscale.org" , Date: 09/07/2016 05:37 AM Subject: [gpfsug-discuss] Remote cluster mount failing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I'm trying to get some multi cluster thing working between two of our GPFS clusters. In the "client" cluster, when trying to mount the "remote" cluster, I get: # mmmount gpfs Wed 7 Sep 13:33:06 BST 2016: mmmount: Mounting file systems ... mount: mount /dev/gpfs on /gpfs failed: Connection timed out mmmount: Command failed. Examine previous error messages to determine cause. And in the log file: Wed Sep 7 13:33:07.481 2016: [N] The client side TLS handshake with node 10.0.0.182 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.486 2016: [N] The client side TLS handshake with node 10.0.0.181 was cancelled: connection reset by peer (return code 420). Wed Sep 7 13:33:07.487 2016: [E] Failed to join remote cluster GPFS_STORAGE.CLUSTER Wed Sep 7 13:33:07.488 2016: [W] Command: err 78: mount GPFS_STORAGE.CLUSTER:gpfs Wed Sep 7 13:33:07.489 2016: Connection timed out In the remote cluster, I see: Wed Sep 7 13:33:07.487 2016: [W] The TLS handshake with node 10.0.0.222 failed with error 447 (server side). Wed Sep 7 13:33:07.488 2016: [X] Connection from 10.10.0.35 refused, authentication failed Wed Sep 7 13:33:07.489 2016: [E] Killing connection from 10.10.0.35, err 703 Wed Sep 7 13:33:07.490 2016: Operation not permitted Weirdly though on other nodes in the client cluster this succeeds fine and can mount, so I think I got all the bits in the mmauth and mmremotecluster configured correctly. Any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "graycol.gif" deleted by Yuri L Volobuev/Austin/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Mon Sep 12 15:48:56 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 12 Sep 2016 10:48:56 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Perhaps if you clearly describe what equipment and connections you have in place and what you're trying to accomplish, someone on this board can propose a solution. In principle, it's always possible to insert proxies/routers to "fake" any two endpoints into "believing" they are communicating directly. From: Aaron Knister To: Date: 09/11/2016 08:01 PM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org After some googling around, I wonder if perhaps what I'm thinking of was an I/O forwarding layer that I understood was being developed for x86_64 type machines rather than some type of GPFS protocol router or proxy. -Aaron On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > A while back I seem to recall hearing about a mechanism being developed > that would function similarly to Lustre's LNET routers and effectively > allow a single set of NSD servers to talk to multiple RDMA fabrics > without requiring the NSD servers to have infiniband interfaces on each > RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each > fabric that would in effect proxy the RDMA requests to the NSD server. > Does anyone know what I'm talking about? Just curious if it's still on > the roadmap. > > -Aaron > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Mon Sep 12 15:57:55 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Sep 2016 16:57:55 +0200 Subject: [gpfsug-discuss] gpfs 4.2.1 and samba export Message-ID: <20160912145755.xhx2du4c3aimkkxt@ics.muni.cz> Hello, I have GPFS version 4.2.1 on Centos 7.2 (kernel 3.10.0-327.22.2.el7.x86_64) and I have got some weird behavior of samba. Windows clients get stucked for almost 1 minute when copying files. I traced down the problematic syscall: 27887 16:39:28.000401 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.Messaging_1.10.22012.0_x86__8wekyb3d8bbwe/SkypeApp/View/HomePage.xaml", {{1473691167, 940424000}, {1473691168, 295355}}, 0) = 0 <74.999775> [...] 27887 16:44:24.000310 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.Windows.Photos_15.1001.16470.0_x64__8wekyb3d8bbwe/Assets/PhotosAppList.contrast-white_targetsize-16.png", {{1473691463, 931319000}, {1473691464, 96608}}, 0) = 0 <74.999841> [...] 27887 16:50:34.002274 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.XboxApp_9.9.30030.0_x64__8wekyb3d8bbwe/_Resources/50.rsrc", {{1473691833, 952166000}, {1473691834, 2166223}}, 0) = 0 <74.997877> [...] 27887 16:53:11.000240 utimensat(AT_FDCWD, "000000-My_Documents/Windows/InfusedApps/Packages/Microsoft.ZuneVideo_3.6.13251.0_x64__8wekyb3d8bbwe/Styles/CommonBrushes.xbf", {{1473691990, 948668000}, {1473691991, 131221}}, 0) = 0 <74.999540> it seems that from time to time, utimensat(2) call takes over 70 (!!) seconds. Normal utimensat syscall looks like: 27887 16:55:16.238132 utimensat(AT_FDCWD, "000000-My_Documents/Windows/Installer/$PatchCache$/Managed/00004109210000000000000000F01FEC/14.0.7015/ACEODDBS.DLL", {{1473692116, 196458000}, {1351702318, 0}}, 0) = 0 <0.000065> At the same time, there is untar running. When samba freezes at utimensat call, untar continues to write data to GPFS (same fs as samba), so it does not seem to me as buffers flush. When the syscall is stucked, I/O utilization of all GPFS disks is below 10 %. mmfsadm dump waiters shows nothing waiting and any cluster node. So any ideas? Or should I just fire PMR? This is cluster config: clusterId 2745894253048382857 autoload no dmapiFileHandleSize 32 minReleaseLevel 4.2.1.0 ccrEnabled yes maxMBpS 20000 maxblocksize 8M cipherList AUTHONLY maxFilesToCache 10000 nsdSmallThreadRatio 1 nsdMaxWorkerThreads 480 ignorePrefetchLUNCount yes pagepool 48G prefetchThreads 320 worker1Threads 320 writebehindThreshhold 10485760 cifsBypassShareLocksOnRename yes cifsBypassTraversalChecking yes allowWriteWithDeleteChild yes adminMode central And this is file system config: flag value description ------------------- ------------------------ ----------------------------------- -f 65536 Minimum fragment size in bytes -i 4096 Inode size in bytes -I 32768 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k all ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 2097152 Block size -Q user;group;fileset Quotas accounting enabled user;group;fileset Quotas enforced none Default quotas enabled --perfileset-quota Yes Per-fileset quota enforcement --filesetdf Yes Fileset df enabled? -V 15.01 (4.2.0.0) File system version --create-time Wed Aug 24 17:38:40 2016 File system creation time -z No Is DMAPI enabled? -L 4194304 Logfile size -E Yes Exact mtime mount option -S No Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 402653184 Maximum number of inodes in all inode spaces --log-replicas 0 Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) -P system Disk storage pools in file system -d nsd_A_m;nsd_B_m;nsd_C_m;nsd_D_m;nsd_A_LV1_d;nsd_A_LV2_d;nsd_A_LV3_d;nsd_A_LV4_d;nsd_B_LV1_d;nsd_B_LV2_d;nsd_B_LV3_d;nsd_B_LV4_d;nsd_C_LV1_d;nsd_C_LV2_d;nsd_C_LV3_d; -d nsd_C_LV4_d;nsd_D_LV1_d;nsd_D_LV2_d;nsd_D_LV3_d;nsd_D_LV4_d Disks in file system -A yes Automatic mount option -o none Additional mount options -T /gpfs/vol1 Default mount point --mount-priority 1 Mount priority -- Luk?? Hejtm?nek From chekh at stanford.edu Mon Sep 12 20:03:15 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 12 Sep 2016 12:03:15 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Message-ID: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu From bbanister at jumptrading.com Mon Sep 12 20:06:59 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 12 Sep 2016 19:06:59 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB0632A645@CHI-EXCHANGEW1.w2k.jumptrading.com> I'd recommend running a mmcheckquota and then check mmlsquota again, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Alex Chekholko Sent: Monday, September 12, 2016 2:03 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 12 20:08:28 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 12 Sep 2016 19:08:28 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Hi Alex, While the numbers don?t match exactly, they?re close enough to prompt me to ask if data replication is possibly set to two? Thanks? Kevin On Sep 12, 2016, at 2:03 PM, Alex Chekholko > wrote: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Mon Sep 12 21:26:51 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 12 Sep 2016 20:26:51 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: My thoughts exactly. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 12 September 2016 20:08 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Hi Alex, While the numbers don?t match exactly, they?re close enough to prompt me to ask if data replication is possibly set to two? Thanks? Kevin On Sep 12, 2016, at 2:03 PM, Alex Chekholko > wrote: Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Mon Sep 12 21:46:55 2016 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 12 Sep 2016 21:46:55 +0100 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: <2C38B1C8-66DB-45C6-AA5D-E612F5BFE935@qsplace.co.uk> However replicated files should show up with ls as taking about double the space. I.e. "ls -lash" 49G -r-------- 1 root root 25G Sep 12 21:11 Somefile I know you've said you checked ls vs du for allocated space it might be worth a double check. Also check that you haven't got a load of snapshots, especially if you have high file churn which will create new blocks; although with your figures it'd have to be very high file churn. -- Lauz On 12 September 2016 21:26:51 BST, "Sobey, Richard A" wrote: >My thoughts exactly. > >Richard > >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of >Buterbaugh, Kevin L >Sent: 12 September 2016 20:08 >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] big difference between output of >'mmlsquota' and 'du'? > >Hi Alex, > >While the numbers don?t match exactly, they?re close enough to prompt >me to ask if data replication is possibly set to two? Thanks? > >Kevin > >On Sep 12, 2016, at 2:03 PM, Alex Chekholko >> wrote: > >Hi, > >For a fileset with a quota on it, we have mmlsquota reporting 39TB >utilization (out of 50TB quota), with 0 in_doubt. > >Running a 'du' on the same directory (where the fileset is junctioned) >shows 21TB usage. > >I looked for sparse files (files that report different size via ls vs >du). I looked at 'du --apparent-size ...' > >https://en.wikipedia.org/wiki/Sparse_file > >What else could it be? > >Is there some attribute I can scan for inside GPFS? >Maybe where FILE_SIZE does not equal KB_ALLOCATED? >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm > > >[root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* >3977 /srv/gsfs0/projects/gbsc/Backups >1 /srv/gsfs0/projects/gbsc/benchmark >13109 /srv/gsfs0/projects/gbsc/Billing >198719 /srv/gsfs0/projects/gbsc/Clinical >1 /srv/gsfs0/projects/gbsc/Clinical_Vendors >1206523 /srv/gsfs0/projects/gbsc/Data >1 /srv/gsfs0/projects/gbsc/iPoP >123165 /srv/gsfs0/projects/gbsc/Macrogen >58676 /srv/gsfs0/projects/gbsc/Misc >6625890 /srv/gsfs0/projects/gbsc/mva >1 /srv/gsfs0/projects/gbsc/Proj >17 /srv/gsfs0/projects/gbsc/Projects >3290502 /srv/gsfs0/projects/gbsc/Resources >1 /srv/gsfs0/projects/gbsc/SeqCenter >1 /srv/gsfs0/projects/gbsc/share >514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring >1 /srv/gsfs0/projects/gbsc/TCGA_Variants >267873 /srv/gsfs0/projects/gbsc/tools >9597797 /srv/gsfs0/projects/gbsc/workspace > >(adds up to about 21TB) > >[root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 > Block Limits | File Limits >Filesystem type GB quota limit in_doubt >grace | files quota limit in_doubt grace Remarks >gsfs0 FILESET 39889 51200 51200 0 >none | 1663212 0 0 4 none > > >[root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc >projects.gbsc Linked /srv/gsfs0/projects/gbsc > >Regards, >-- >Alex Chekholko chekh at stanford.edu > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? >Kevin Buterbaugh - Senior System Administrator >Vanderbilt University - Advanced Computing Center for Research and >Education >Kevin.Buterbaugh at vanderbilt.edu >- (615)875-9633 > > > > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Sep 12 22:37:08 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 12 Sep 2016 21:37:08 +0000 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Maybe you have a huge file open, that's been unlinked and still growing? -jf -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 12 22:59:36 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 12 Sep 2016 14:59:36 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and'du'? In-Reply-To: References: Message-ID: 'du' tallies up 'blocks allocated', not file sizes. So it shouldn't matter whether any sparse files are present. GPFS doesn't charge quota for data in snapshots (whether it should is a separate question). The observed discrepancy has two plausible causes: 1) Inaccuracy in quota accounting (more likely) 2) Artefacts of data replication (less likely) Running mmcheckquota in this situation would be a good idea. yuri From: Alex Chekholko To: gpfsug-discuss at spectrumscale.org, Date: 09/12/2016 12:04 PM Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, For a fileset with a quota on it, we have mmlsquota reporting 39TB utilization (out of 50TB quota), with 0 in_doubt. Running a 'du' on the same directory (where the fileset is junctioned) shows 21TB usage. I looked for sparse files (files that report different size via ls vs du). I looked at 'du --apparent-size ...' https://en.wikipedia.org/wiki/Sparse_file What else could it be? Is there some attribute I can scan for inside GPFS? Maybe where FILE_SIZE does not equal KB_ALLOCATED? https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_usngfileattrbts.htm [root at scg-gs0 ~]# du -sm --apparent-size /srv/gsfs0/projects/gbsc/* 3977 /srv/gsfs0/projects/gbsc/Backups 1 /srv/gsfs0/projects/gbsc/benchmark 13109 /srv/gsfs0/projects/gbsc/Billing 198719 /srv/gsfs0/projects/gbsc/Clinical 1 /srv/gsfs0/projects/gbsc/Clinical_Vendors 1206523 /srv/gsfs0/projects/gbsc/Data 1 /srv/gsfs0/projects/gbsc/iPoP 123165 /srv/gsfs0/projects/gbsc/Macrogen 58676 /srv/gsfs0/projects/gbsc/Misc 6625890 /srv/gsfs0/projects/gbsc/mva 1 /srv/gsfs0/projects/gbsc/Proj 17 /srv/gsfs0/projects/gbsc/Projects 3290502 /srv/gsfs0/projects/gbsc/Resources 1 /srv/gsfs0/projects/gbsc/SeqCenter 1 /srv/gsfs0/projects/gbsc/share 514041 /srv/gsfs0/projects/gbsc/SNAP_Scoring 1 /srv/gsfs0/projects/gbsc/TCGA_Variants 267873 /srv/gsfs0/projects/gbsc/tools 9597797 /srv/gsfs0/projects/gbsc/workspace (adds up to about 21TB) [root at scg-gs0 ~]# mmlsquota -j projects.gbsc --block-size=G gsfs0 Block Limits | File Limits Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks gsfs0 FILESET 39889 51200 51200 0 none | 1663212 0 0 4 none [root at scg-gs0 ~]# mmlsfileset gsfs0 |grep gbsc projects.gbsc Linked /srv/gsfs0/projects/gbsc Regards, -- Alex Chekholko chekh at stanford.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From chekh at stanford.edu Mon Sep 12 23:11:12 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 12 Sep 2016 15:11:12 -0700 Subject: [gpfsug-discuss] big difference between output of 'mmlsquota' and 'du'? In-Reply-To: References: Message-ID: Thanks for all the responses. I will look through the filesystem clients for open file handles; we have definitely had deleted open log files of multi-TB size before. The filesystem has replication set to 1. We don't use snapshots. I'm running a 'mmrestripefs -r' (some files were ill-placed from aborted pool migrations) and then I will run an 'mmcheckquota'. On 9/12/16 2:37 PM, Jan-Frode Myklebust wrote: > Maybe you have a huge file open, that's been unlinked and still growing? > > > > -jf > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Alex Chekholko chekh at stanford.edu From xhejtman at ics.muni.cz Mon Sep 12 23:30:19 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Sep 2016 00:30:19 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek From volobuev at us.ibm.com Mon Sep 12 23:42:00 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 12 Sep 2016 15:42:00 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: The increasing value of snapId is not a problem. Creating snapshots every 15 min is somewhat more frequent than what is customary, but as long as you're able to delete filesets at the same rate you're creating them, this should work OK. yuri From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org, Date: 09/12/2016 03:30 PM Subject: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Sep 13 04:19:30 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 13 Sep 2016 03:19:30 +0000 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: Don't worry. We do 400+ snapshots every 4 hours and that number is only getting bigger. Don't know what our current snapid count is mind you, can find out when in the office. Get Outlook for Android On Mon, Sep 12, 2016 at 11:30 PM +0100, "Lukas Hejtmanek" > wrote: Hello, using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 minutes during working hours). It seems that snapid is increasing only number. Should I be fine with such a number of snapshots per day? I guess we could reach snapid 100,000. I remove all these snapshots during night so I do not keep huge number of snapshots. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Tue Sep 13 05:06:42 2016 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Tue, 13 Sep 2016 05:06:42 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <7EAC0DD4-6FC1-4DF5-825E-9E2DD966BA4E@qsplace.co.uk> There are many people doing the same thing so nothing to worry about. As your using 4.2.1 you can at least bulk delete the snapshots using a comma separated list, making life just that little bit easier. -- Lauz On 13 September 2016 04:19:30 BST, "Sobey, Richard A" wrote: >Don't worry. We do 400+ snapshots every 4 hours and that number is only >getting bigger. Don't know what our current snapid count is mind you, >can find out when in the office. > >Get Outlook for Android > > > >On Mon, Sep 12, 2016 at 11:30 PM +0100, "Lukas Hejtmanek" >> wrote: > >Hello, > >using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 >minutes >during working hours). It seems that snapid is increasing only number. >Should >I be fine with such a number of snapshots per day? I guess we could >reach >snapid 100,000. I remove all these snapshots during night so I do not >keep >huge number of snapshots. > >-- >Luk?? Hejtm?nek >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Valdis.Kletnieks at vt.edu Tue Sep 13 05:32:24 2016 From: Valdis.Kletnieks at vt.edu (Valdis.Kletnieks at vt.edu) Date: Tue, 13 Sep 2016 00:32:24 -0400 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <20635.1473741144@turing-police.cc.vt.edu> On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: > I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. From jtucker at pixitmedia.com Tue Sep 13 10:10:02 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Tue, 13 Sep 2016 10:10:02 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <20635.1473741144@turing-police.cc.vt.edu> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> <20635.1473741144@turing-police.cc.vt.edu> Message-ID: <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: > On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: >> I guess we could reach snapid 100,000. > It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. > > What you *do* want to do is make the snap *name* meaningful, using > a timestamp or something to keep your sanity. > > mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Tue Sep 13 21:51:16 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 13 Sep 2016 13:51:16 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org Hey Yuri, ? Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From xhejtman at ics.muni.cz Tue Sep 13 21:57:52 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Sep 2016 22:57:52 +0200 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> Message-ID: <20160913205752.3lmmfbhm25mu77j4@ics.muni.cz> Yuri et al. thank you for answers, I should be fine with snapshots as you suggest. On Mon, Sep 12, 2016 at 03:42:00PM -0700, Yuri L Volobuev wrote: > The increasing value of snapId is not a problem. Creating snapshots every > 15 min is somewhat more frequent than what is customary, but as long as > you're able to delete filesets at the same rate you're creating them, this > should work OK. > > yuri > > > > From: Lukas Hejtmanek > To: gpfsug-discuss at spectrumscale.org, > Date: 09/12/2016 03:30 PM > Subject: [gpfsug-discuss] gpfs snapshots > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hello, > > using gpfs 4.2.1, I do about 60 snapshots per day (one snapshot per 15 > minutes > during working hours). It seems that snapid is increasing only number. > Should > I be fine with such a number of snapshots per day? I guess we could reach > snapid 100,000. I remove all these snapshots during night so I do not keep > huge number of snapshots. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek From S.J.Thompson at bham.ac.uk Tue Sep 13 22:21:59 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Tue, 13 Sep 2016 21:21:59 +0000 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com>, Message-ID: I thought the GUI implemented some form of snapshot scheduler. Personal opinion is that is the wrong place and I agree that is should be core functionality to ensure that the scheduler is running properly. But I would suggest that it might be more than just snapshots people might want to schedule. E.g. An ilm pool flush. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yuri L Volobuev [volobuev at us.ibm.com] Sent: 13 September 2016 21:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs snapshots Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri [Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is]Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From mark.bergman at uphs.upenn.edu Tue Sep 13 22:23:57 2016 From: mark.bergman at uphs.upenn.edu (mark.bergman at uphs.upenn.edu) Date: Tue, 13 Sep 2016 17:23:57 -0400 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: Your message of "Tue, 13 Sep 2016 13:51:16 -0700." References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: <19294-1473801837.563347@J_5h.TM7K.YXzn> In the message dated: Tue, 13 Sep 2016 13:51:16 -0700, The pithy ruminations from Yuri L Volobuev on were: => => Hi Jez, => => It sounds to me like the functionality that you're _really_ looking for is => an ability to to do automated snapshot management, similar to what's Yep. => available on other storage systems. For example, "create a new snapshot of => filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many Or, take a snapshot every 15min, keep the 4 most recent, expire all except 4 that were created within 6hrs, only 4 created between 6:01-24:00 hh:mm ago, and expire all-but-2 created between 24:01-48:00, etc, as we do. => examples of sysadmins rolling their own snapshot management system along => those lines, and an ability to add an expiration string as a snapshot I'd be glad to distribute our local example of this exercise. => "comment" appears to be merely an aid in keeping such DIY snapshot => management scripts a bit simpler -- not by much though. The end user would => still be on the hook for some heavy lifting, in particular figuring out a => way to run an equivalent of a cluster-aware cron with acceptable fault => tolerance semantics. That is, if a snapshot creation is scheduled, only => one node in the cluster should attempt to create the snapshot, but if that => node fails, another node needs to step in (as opposed to skipping the => scheduled snapshot creation). This is doable outside of GPFS, of course, => but is not trivial. Architecturally, the right place to implement a Ah, that part really is trivial....In our case, the snapshot program takes the filesystem name as an argument... we simply rely on the GPFS fault detection/failover. The job itself runs (via cron) on every GPFS server node, but only creates the snapshot on the server that is the active manager for the specified filesystem: ############################################################################## # Check if the node where this script is running is the GPFS manager node for the # specified filesystem manager=`/usr/lpp/mmfs/bin/mmlsmgr $filesys | grep -w "^$filesys" |awk '{print $2}'` ip addr list | grep -qw "$manager" if [ $? != 0 ] ; then # This node is not the manager...exit exit fi # else ... continue and create the snapshot ################################################################################################### => => yuri => => -- Mark Bergman voice: 215-746-4061 mark.bergman at uphs.upenn.edu fax: 215-614-0266 http://www.cbica.upenn.edu/ IT Technical Director, Center for Biomedical Image Computing and Analytics Department of Radiology University of Pennsylvania PGP Key: http://www.cbica.upenn.edu/sbia/bergman From jtolson at us.ibm.com Tue Sep 13 22:47:02 2016 From: jtolson at us.ibm.com (John T Olson) Date: Tue, 13 Sep 2016 14:47:02 -0700 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz><20635.1473741144@turing-police.cc.vt.edu><00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com>, Message-ID: We do have a general-purpose scheduler on the books as an item that is needed for a future release and as Yuri mentioned it would be cluster wide to avoid the single point of failure with tools like Cron. However, it's one of many things we want to try to get into the product and so we don't have any definite timeline yet. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. From: "Simon Thompson (Research Computing - IT Services)" To: gpfsug main discussion list Date: 09/13/2016 02:22 PM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org I thought the GUI implemented some form of snapshot scheduler. Personal opinion is that is the wrong place and I agree that is should be core functionality to ensure that the scheduler is running properly. But I would suggest that it might be more than just snapshots people might want to schedule. E.g. An ilm pool flush. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yuri L Volobuev [volobuev at us.ibm.com] Sent: 13 September 2016 21:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs snapshots Hi Jez, It sounds to me like the functionality that you're _really_ looking for is an ability to to do automated snapshot management, similar to what's available on other storage systems. For example, "create a new snapshot of filesets X, Y, Z every 30 min, keep the last 16 snapshots". I've seen many examples of sysadmins rolling their own snapshot management system along those lines, and an ability to add an expiration string as a snapshot "comment" appears to be merely an aid in keeping such DIY snapshot management scripts a bit simpler -- not by much though. The end user would still be on the hook for some heavy lifting, in particular figuring out a way to run an equivalent of a cluster-aware cron with acceptable fault tolerance semantics. That is, if a snapshot creation is scheduled, only one node in the cluster should attempt to create the snapshot, but if that node fails, another node needs to step in (as opposed to skipping the scheduled snapshot creation). This is doable outside of GPFS, of course, but is not trivial. Architecturally, the right place to implement a fault-tolerant cluster-aware scheduling framework is GPFS itself, as the most complex pieces are already there. We have some plans for work along those lines, but if you want to reinforce the point with an RFE, that would be fine, too. yuri [Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is]Jez Tucker ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in From: Jez Tucker To: gpfsug-discuss at spectrumscale.org, Date: 09/13/2016 02:10 AM Subject: Re: [gpfsug-discuss] gpfs snapshots Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss (See attached file: graycol.gif) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From jtucker at pixitmedia.com Tue Sep 13 23:28:22 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Tue, 13 Sep 2016 23:28:22 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: <20160912223019.s665s3ccoltdkq3l@ics.muni.cz> <20635.1473741144@turing-police.cc.vt.edu> <00e74006-8f99-280f-8832-1cee019c4b07@pixitmedia.com> Message-ID: <2336bbd5-39ca-dc0d-e1b4-7a301c6b9f2e@pixitmedia.com> Hey So yes, you're quite right - we have higher order fault tolerant cluster wide methods of dealing with such requirements already. However, I still think the end user should be empowered to be able construct such methods themselves if needs be. Yes, the comment is merely an aid [but also useful as a generic comment field] and as such could be utilised to encode basic metadata into the comment field. I'll log an RFE and see where we go from here. Cheers Jez On 13/09/16 21:51, Yuri L Volobuev wrote: > > Hi Jez, > > It sounds to me like the functionality that you're _really_ looking > for is an ability to to do automated snapshot management, similar to > what's available on other storage systems. For example, "create a new > snapshot of filesets X, Y, Z every 30 min, keep the last 16 > snapshots". I've seen many examples of sysadmins rolling their own > snapshot management system along those lines, and an ability to add an > expiration string as a snapshot "comment" appears to be merely an aid > in keeping such DIY snapshot management scripts a bit simpler -- not > by much though. The end user would still be on the hook for some heavy > lifting, in particular figuring out a way to run an equivalent of a > cluster-aware cron with acceptable fault tolerance semantics. That is, > if a snapshot creation is scheduled, only one node in the cluster > should attempt to create the snapshot, but if that node fails, another > node needs to step in (as opposed to skipping the scheduled snapshot > creation). This is doable outside of GPFS, of course, but is not > trivial. Architecturally, the right place to implement a > fault-tolerant cluster-aware scheduling framework is GPFS itself, as > the most complex pieces are already there. We have some plans for work > along those lines, but if you want to reinforce the point with an RFE, > that would be fine, too. > > yuri > > Inactive hide details for Jez Tucker ---09/13/2016 02:10:31 AM---Hey > Yuri, Perhaps an RFE here, but could I suggest there isJez Tucker > ---09/13/2016 02:10:31 AM---Hey Yuri, Perhaps an RFE here, but could I > suggest there is much value in > > From: Jez Tucker > To: gpfsug-discuss at spectrumscale.org, > Date: 09/13/2016 02:10 AM > Subject: Re: [gpfsug-discuss] gpfs snapshots > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Hey Yuri, > > Perhaps an RFE here, but could I suggest there is much value in > adding a -c option to mmcrsnapshot? > > Use cases: > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "Before phase 2" > > and > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "expire:GMT-2017.04.21-16.00.00" > > Ideally also: mmcrsnapshot fs1 > fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr > > Then it's easy to iterate over snapshots and subsequently > mmdelsnapshot snaps which are no longer required. > There are lots of methods to achieve this, but without external > databases / suchlike, this is rather simple and effective for end users. > > Alternatively a second comment like -expire flag as user metadata may > be preferential. > > Thoughts? > > Jez > > > On 13/09/16 05:32, _Valdis.Kletnieks at vt.edu_ > wrote: > > On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: > I guess we could reach snapid 100,000. > > It probably stores the snap ID as a 32 or 64 bit int, so 100K > is peanuts. > > What you *do* want to do is make the snap *name* meaningful, using > a timestamp or something to keep your sanity. > > mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Jez Tucker > Head of Research & Product Development > Pixit Media_ > __www.pixitmedia.com_ > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other > person. Please notify the sender immediately and delete this email > from your computer system. Any opinions expressed are not necessarily > those of the company from which this email was sent and, whilst to the > best of our knowledge no viruses or defects exist, no responsibility > can be accepted for any loss or damage arising from its receipt or > subsequent use of this > email._______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media Mobile: +44 (0) 776 419 3820 www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From service at metamodul.com Wed Sep 14 19:10:37 2016 From: service at metamodul.com (service at metamodul.com) Date: Wed, 14 Sep 2016 20:10:37 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: Why not use a GPFS user extented attribut for that ? In a certain way i see GPFS as a database. ^_^ Hajo Von Samsung Mobile gesendet
-------- Urspr?ngliche Nachricht --------
Von: Jez Tucker
Datum:2016.09.13 11:10 (GMT+01:00)
An: gpfsug-discuss at spectrumscale.org
Betreff: Re: [gpfsug-discuss] gpfs snapshots
Hey Yuri, Perhaps an RFE here, but could I suggest there is much value in adding a -c option to mmcrsnapshot? Use cases: mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "Before phase 2" and mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c "expire:GMT-2017.04.21-16.00.00" Ideally also: mmcrsnapshot fs1 fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr Then it's easy to iterate over snapshots and subsequently mmdelsnapshot snaps which are no longer required. There are lots of methods to achieve this, but without external databases / suchlike, this is rather simple and effective for end users. Alternatively a second comment like -expire flag as user metadata may be preferential. Thoughts? Jez On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: I guess we could reach snapid 100,000. It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. What you *do* want to do is make the snap *name* meaningful, using a timestamp or something to keep your sanity. mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Wed Sep 14 19:21:20 2016 From: service at metamodul.com (service at metamodul.com) Date: Wed, 14 Sep 2016 20:21:20 +0200 Subject: [gpfsug-discuss] gpfs snapshots Message-ID: <4fojjlpuwqoalkffaahy7snf.1473877280415@email.android.com> I am missing since ages such a framework. I had my simple one devoloped on the gpfs callbacks which allowed me to have a centralized cron (HA) up to oracle also ?high available and ha nfs on Aix. Hajo Universal Inventor? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Wed Sep 14 19:49:36 2016 From: jtucker at pixitmedia.com (Jez Tucker) Date: Wed, 14 Sep 2016 19:49:36 +0100 Subject: [gpfsug-discuss] gpfs snapshots In-Reply-To: References: Message-ID: Hi I still think I'm coming down on the side of simplistic ease of use: Example: [jtucker at pixstor ~]# mmlssnapshot mmfs1 Snapshots in file system mmfs1: Directory SnapId Status Created Fileset Comment @GMT-2016.09.13-23.00.14 551 Valid Wed Sep 14 00:00:02 2016 myproject Prior to phase 1 @GMT-2016.09.14-05.00.14 552 Valid Wed Sep 14 06:00:01 2016 myproject Added this and that @GMT-2016.09.14-11.00.14 553 Valid Wed Sep 14 12:00:01 2016 myproject Merged project2 @GMT-2016.09.14-17.00.14 554 Valid Wed Sep 14 18:00:02 2016 myproject Before clean of .xmp @GMT-2016.09.14-17.05.30 555 Valid Wed Sep 14 18:05:03 2016 myproject Archival Jez On 14/09/16 19:10, service at metamodul.com wrote: > Why not use a GPFS user extented attribut for that ? > In a certain way i see GPFS as a database. ^_^ > Hajo > > > > Von Samsung Mobile gesendet > > > -------- Urspr?ngliche Nachricht -------- > Von: Jez Tucker > Datum:2016.09.13 11:10 (GMT+01:00) > An: gpfsug-discuss at spectrumscale.org > Betreff: Re: [gpfsug-discuss] gpfs snapshots > > Hey Yuri, > > Perhaps an RFE here, but could I suggest there is much value in > adding a -c option to mmcrsnapshot? > > Use cases: > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "Before phase 2" > > and > > mmcrsnapshot myfsname @GMT-2016.09.13-10.00.00 -j myfilesetname -c > "expire:GMT-2017.04.21-16.00.00" > > Ideally also: mmcrsnapshot fs1 > fset1:snapA:expirestr,fset2:snapB:expirestr,fset3:snapC:expirestr > > Then it's easy to iterate over snapshots and subsequently > mmdelsnapshot snaps which are no longer required. > There are lots of methods to achieve this, but without external > databases / suchlike, this is rather simple and effective for end users. > > Alternatively a second comment like -expire flag as user metadata may > be preferential. > > Thoughts? > > Jez > > > On 13/09/16 05:32, Valdis.Kletnieks at vt.edu wrote: >> On Tue, 13 Sep 2016 00:30:19 +0200, Lukas Hejtmanek said: >>> I guess we could reach snapid 100,000. >> It probably stores the snap ID as a 32 or 64 bit int, so 100K is peanuts. >> >> What you *do* want to do is make the snap *name* meaningful, using >> a timestamp or something to keep your sanity. >> >> mmcrsnapshot fs923 `date +%y%m%d-%H%M` or similar. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Jez Tucker > Head of Research & Product Development > Pixit Media > www.pixitmedia.com > -- Jez Tucker Head of Research & Product Development Pixit Media www.pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From secretary at gpfsug.org Thu Sep 15 09:42:54 2016 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 15 Sep 2016 09:42:54 +0100 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Message-ID: Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 [1] Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.botcherby at kcl.ac.uk Thu Sep 15 09:45:47 2016 From: peter.botcherby at kcl.ac.uk (Botcherby, Peter) Date: Thu, 15 Sep 2016 08:45:47 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: Hi Claire, Hope you are well - I will be away for this as going to Indonesia on the 18th October for my nephew?s wedding. Regards Peter From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Secretary GPFS UG Sent: 15 September 2016 09:43 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mimarsh2 at vt.edu Thu Sep 15 17:49:27 2016 From: mimarsh2 at vt.edu (Brian Marshall) Date: Thu, 15 Sep 2016 12:49:27 -0400 Subject: [gpfsug-discuss] EDR and omnipath Message-ID: All, I see in the GPFS FAQ A6.3 the statement below. Is it possible to have GPFS do RDMA over EDR infiniband and non-RDMA communication over omnipath (IP over fabric) when each NSD server has an EDR card and a OPA card installed? RDMA is not supported on a node when both Mellanox HCAs and Intel Omni-Path HFIs are enabled for RDMA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Sep 15 16:33:17 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 15 Sep 2016 15:33:17 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Message-ID: Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname -j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It's mildly irritating to say the least! Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Sep 16 15:21:58 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 16 Sep 2016 10:21:58 -0400 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" To: "'gpfsug-discuss at spectrumscale.org'" Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Sep 16 15:40:52 2016 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 16 Sep 2016 14:40:52 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: Thanks Marc. Regrettably in this case, the only way I knew to delete a snapshot (listed below) has broken going from 3.5 to 4.2.1. Creating snaps has suffered the same fate. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: 16 September 2016 15:22 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Sanchez at deshaw.com Fri Sep 16 20:49:14 2016 From: Paul.Sanchez at deshaw.com (Sanchez, Paul) Date: Fri, 16 Sep 2016 19:49:14 +0000 Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes In-Reply-To: References: Message-ID: <3e1f02b30e1a49ef950de7910801f5d1@mbxtoa1.winmail.deshaw.com> The old syntax works unless have a colon in your snapshot names. In that case, the portion before the first colon will be interpreted as a fileset name. So if you use RFC 3339/ISO 8601 date/times, that?s a problem: The syntax for creating and deleting snapshots goes from this: mm{cr|del}snapshot fs100 SNAP at 2016-07-31T13:00:07Z ?j 1000466 to this: mm{cr|del}snapshot fs100 1000466:SNAP at 2016-07-31T13:00:07Z If you are dealing with filesystem level snapshots then you just need a leading colon: mm{cr|del}snapshot fs100 :SNAP at 2016-07-31T13:00:07Z Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Friday, September 16, 2016 10:22 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes I think at least the most popular old forms still work, even if the documentation and usage messages were scrubbed. So generally, for example, neither your scripts nor your fingers will break. ;-) From: "Sobey, Richard A" > To: "'gpfsug-discuss at spectrumscale.org'" > Date: 09/16/2016 07:02 AM Subject: [gpfsug-discuss] Bit of a rant about snapshot command syntax changes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Why was mmdelsnapshot (and possibly other snapshot related commands) changed from: Mmdelsnapshot device snapshotname ?j filesetname TO Mmdelsnapshot device filesetname:snapshotname ..between 4.2.0 and 4.2.1? It?s mildly irritating to say the least! Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Mon Sep 19 08:11:38 2016 From: viccornell at gmail.com (Vic Cornell) Date: Mon, 19 Sep 2016 08:11:38 +0100 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> Bump I can see no reason why that wouldn't work. But it would be nice to a have an official answer or evidence that it works. Vic > On 15 Sep 2016, at 5:49 pm, Brian Marshall wrote: > > All, > > I see in the GPFS FAQ A6.3 the statement below. Is it possible to have GPFS do RDMA over EDR infiniband and non-RDMA communication over omnipath (IP over fabric) when each NSD server has an EDR card and a OPA card installed? > > > > RDMA is not supported on a node when both Mellanox HCAs and Intel Omni-Path HFIs are enabled for RDMA. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Sep 19 20:18:18 2016 From: mweil at wustl.edu (Matt Weil) Date: Mon, 19 Sep 2016 14:18:18 -0500 Subject: [gpfsug-discuss] increasing inode Message-ID: All, What exactly happens that makes the clients hang when a file set inodes are increased? ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From aaron.s.knister at nasa.gov Mon Sep 19 21:34:53 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 16:34:53 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> References: <87E46193-4D65-41A3-AB0E-B12987F6FFC3@gmail.com> Message-ID: <6f33c20f-ff8f-9ccc-4609-920b35058138@nasa.gov> I must admit, I'm curious as to why one cannot use GPFS with IB and OPA both in RDMA mode. Granted, I know very little about OPA but if it just presents as another verbs device I wonder why it wouldn't "Just work" as long as GPFS is configured correctly. -Aaron On 9/19/16 3:11 AM, Vic Cornell wrote: > Bump > > I can see no reason why that wouldn't work. But it would be nice to a > have an official answer or evidence that it works. > > Vic > > >> On 15 Sep 2016, at 5:49 pm, Brian Marshall > > wrote: >> >> All, >> >> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >> omnipath (IP over fabric) when each NSD server has an EDR card and a >> OPA card installed? >> >> >> >> RDMA is not supported on a node when both Mellanox HCAs and Intel >> Omni-Path HFIs are enabled for RDMA. >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at us.ibm.com Mon Sep 19 21:43:31 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Mon, 19 Sep 2016 20:43:31 +0000 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: <6f33c20f-ff8f-9ccc-4609-920b35058138@nasa.gov> Message-ID: Because they both require a different distribution of OFED, which are mutual exclusive to install. in theory if you deploy plain OFED it might work, but that will be hard to find somebody to support. Sent from IBM Verse Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- From:"Aaron Knister" To:gpfsug-discuss at spectrumscale.orgDate:Mon, Sep 19, 2016 1:35 PMSubject:Re: [gpfsug-discuss] EDR and omnipath I must admit, I'm curious as to why one cannot use GPFS with IB and OPA both in RDMA mode. Granted, I know very little about OPA but if it just presents as another verbs device I wonder why it wouldn't "Just work" as long as GPFS is configured correctly.-AaronOn 9/19/16 3:11 AM, Vic Cornell wrote:> Bump>> I can see no reason why that wouldn't work. But it would be nice to a> have an official answer or evidence that it works.>> Vic>>>> On 15 Sep 2016, at 5:49 pm, Brian Marshall > > wrote:>>>> All,>>>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over>> omnipath (IP over fabric) when each NSD server has an EDR card and a>> OPA card installed?>>>>>>>> RDMA is not supported on a node when both Mellanox HCAs and Intel>> Omni-Path HFIs are enabled for RDMA.>>>> _______________________________________________>> gpfsug-discuss mailing list>> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>>>> _______________________________________________> gpfsug-discuss mailing list> gpfsug-discuss at spectrumscale.org> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>-- Aaron KnisterNASA Center for Climate Simulation (Code 606.2)Goddard Space Flight Center(301) 286-2776_______________________________________________gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 19 21:55:32 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 16:55:32 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: Ah, that makes complete sense. Thanks! I had been doing some reading about OmniPath and for some reason was under the impression the OmniPath adapter could load itself as a driver under the verbs stack of OFED. Even so, that raises support concerns as you say. I wonder what folks are doing who have IB-based block storage fabrics but wanting to connect to OmniPath-based fabrics? I'm also curious how GNR customers would be able to serve both IB-based and an OmniPath-based fabrics over RDMA where performance is best? This is is along the lines of my GPFS protocol router question from the other day. -Aaron On 9/19/16 4:43 PM, Sven Oehme wrote: > Because they both require a different distribution of OFED, which are > mutual exclusive to install. > in theory if you deploy plain OFED it might work, but that will be hard > to find somebody to support. > > > Sent from IBM Verse > > Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- > > From: "Aaron Knister" > To: gpfsug-discuss at spectrumscale.org > Date: Mon, Sep 19, 2016 1:35 PM > Subject: Re: [gpfsug-discuss] EDR and omnipath > > ------------------------------------------------------------------------ > > I must admit, I'm curious as to why one cannot use GPFS with IB and OPA > both in RDMA mode. Granted, I know very little about OPA but if it just > presents as another verbs device I wonder why it wouldn't "Just work" as > long as GPFS is configured correctly. > > -Aaron > > On 9/19/16 3:11 AM, Vic Cornell wrote: >> Bump >> >> I can see no reason why that wouldn't work. But it would be nice to a >> have an official answer or evidence that it works. >> >> Vic >> >> >>> On 15 Sep 2016, at 5:49 pm, Brian Marshall >> > wrote: >>> >>> All, >>> >>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >>> omnipath (IP over fabric) when each NSD server has an EDR card and a >>> OPA card installed? >>> >>> >>> >>> RDMA is not supported on a node when both Mellanox HCAs and Intel >>> Omni-Path HFIs are enabled for RDMA. >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Sep 19 22:03:51 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 19 Sep 2016 17:03:51 -0400 Subject: [gpfsug-discuss] EDR and omnipath In-Reply-To: References: Message-ID: <99103c73-baf0-f421-f64d-1d5ee916d340@nasa.gov> Here's where I read about the inter-operability of the two: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/omni-path-storage-white-paper.pdf This is what Intel says: > In a multi-homed file system server, or in a Lustre Networking (LNet) or IP router, a single OpenFabrics Al- liance (OFA) software environment supporting both an Intel OPA HFI and a Mellanox* InfiniBand HCA is required. The OFA software stack is architected to support multiple tar- geted network types. Currently, the OFA stack simultaneously supports iWARP for Ethernet, RDMA over Converged Ethernet (RoCE), and InfiniBand networks, and the Intel OPA network has been added to that list. As the OS distributions implement their OFA stacks, it will be validated to simultaneously support both Intel OPA Host > Intel is working closely with the major Linux distributors, including Red Hat* and SUSE*, to ensure that Intel OPA support is integrated into their OFA implementation. Once this is accomplished, then simultaneous Mellanox InfiniBand and Intel OPA support will be present in the standard Linux distributions. So it seems as though Intel is relying on the OS vendors to bridge the support gap between them and Mellanox. -Aaron On 9/19/16 4:55 PM, Aaron Knister wrote: > Ah, that makes complete sense. Thanks! > > I had been doing some reading about OmniPath and for some reason was > under the impression the OmniPath adapter could load itself as a driver > under the verbs stack of OFED. Even so, that raises support concerns as > you say. > > I wonder what folks are doing who have IB-based block storage fabrics > but wanting to connect to OmniPath-based fabrics? > > I'm also curious how GNR customers would be able to serve both IB-based > and an OmniPath-based fabrics over RDMA where performance is best? This > is is along the lines of my GPFS protocol router question from the other > day. > > -Aaron > > On 9/19/16 4:43 PM, Sven Oehme wrote: >> Because they both require a different distribution of OFED, which are >> mutual exclusive to install. >> in theory if you deploy plain OFED it might work, but that will be hard >> to find somebody to support. >> >> >> Sent from IBM Verse >> >> Aaron Knister --- Re: [gpfsug-discuss] EDR and omnipath --- >> >> From: "Aaron Knister" >> To: gpfsug-discuss at spectrumscale.org >> Date: Mon, Sep 19, 2016 1:35 PM >> Subject: Re: [gpfsug-discuss] EDR and omnipath >> >> ------------------------------------------------------------------------ >> >> I must admit, I'm curious as to why one cannot use GPFS with IB and OPA >> both in RDMA mode. Granted, I know very little about OPA but if it just >> presents as another verbs device I wonder why it wouldn't "Just work" as >> long as GPFS is configured correctly. >> >> -Aaron >> >> On 9/19/16 3:11 AM, Vic Cornell wrote: >>> Bump >>> >>> I can see no reason why that wouldn't work. But it would be nice to a >>> have an official answer or evidence that it works. >>> >>> Vic >>> >>> >>>> On 15 Sep 2016, at 5:49 pm, Brian Marshall >>> > wrote: >>>> >>>> All, >>>> >>>> I see in the GPFS FAQ A6.3 the statement below. Is it possible to >>>> have GPFS do RDMA over EDR infiniband and non-RDMA communication over >>>> omnipath (IP over fabric) when each NSD server has an EDR card and a >>>> OPA card installed? >>>> >>>> >>>> >>>> RDMA is not supported on a node when both Mellanox HCAs and Intel >>>> Omni-Path HFIs are enabled for RDMA. >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Tue Sep 20 14:22:51 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 20 Sep 2016 09:22:51 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov> <712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From makaplan at us.ibm.com Tue Sep 20 15:01:49 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 20 Sep 2016 10:01:49 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: References: <5F910253243E6A47B81A9A2EB424BBA101D137D1@NDMSMBX404.ndc.nasa.gov><712e5024-e6eb-9195-d9cd-f59a9b145e60@nasa.gov> Message-ID: Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 20 15:07:38 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Tue, 20 Sep 2016 14:07:38 +0000 Subject: [gpfsug-discuss] GPFS Routers References: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D24844@NDMSMBX404.ndc.nasa.gov> Not sure if this image will go through but here's one I found: [X] The "Routers" are LNET routers. LNET is just the name of lustre's network stack. The LNET routers "route" the Lustre protocol between disparate network types (quadrics, Ethernet, myrinet, carrier pigeon). Packet loss on carrier pigeon is particularly brutal, though. From: Marc A Kaplan Sent: 9/20/16, 10:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 20 15:08:46 2016 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Tue, 20 Sep 2016 14:08:46 +0000 Subject: [gpfsug-discuss] GPFS Routers References: [gpfsug-discuss] GPFS Routers Message-ID: <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> Looks like the attachment got scrubbed. Here's the link http://docplayer.net/docs-images/39/19199001/images/7-0.png[X] From: aaron.s.knister at nasa.gov Sent: 9/20/16, 10:07 AM To: gpfsug main discussion list, gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Not sure if this image will go through but here's one I found: [X] The "Routers" are LNET routers. LNET is just the name of lustre's network stack. The LNET routers "route" the Lustre protocol between disparate network types (quadrics, Ethernet, myrinet, carrier pigeon). Packet loss on carrier pigeon is particularly brutal, though. From: Marc A Kaplan Sent: 9/20/16, 10:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Routers Thanks for spelling out the situation more clearly. This is beyond my knowledge and expertise. But perhaps some other participants on this forum will chime in! I may be missing something, but asking "What is Lustre LNET?" via google does not yield good answers. It would be helpful to have some graphics (pictures!) of typical, useful configurations. Limiting myself to a few minutes of searching, I couldn't find any. I "get" that Lustre users/admin with lots of nodes and several switching fabrics find it useful, but beyond that... I guess the answer will be "Performance!" -- but the obvious question is: Why not "just" use IP - that is the Internetworking Protocol! So rather than sweat over LNET, why not improve IP to work better over several IBs? >From a user/customer point of view where "I needed this yesterday", short of having an "LNET for GPFS", I suggest considering reconfiguring your nodes, switches, storage to get better performance. If you need to buy some more hardware, so be it. --marc From: Aaron Knister To: Date: 09/20/2016 09:23 AM Subject: Re: [gpfsug-discuss] GPFS Routers Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, Currently we serve three disparate infiniband fabrics with three separate sets of NSD servers all connected via FC to backend storage. I was exploring the idea of flipping that on its head and having one set of NSD servers but would like something akin to Lustre LNET routers to connect each fabric to the back-end NSD servers over IB. I know there's IB routers out there now but I'm quite drawn to the idea of a GPFS equivalent of Lustre LNET routers, having used them in the past. I suppose I could always smush some extra HCAs in the NSD servers and do it that way but that got really ugly when I started factoring in omnipath. Something like an LNET router would also be useful for GNR users who would like to present to both an IB and an OmniPath fabric over RDMA. -Aaron On 9/12/16 10:48 AM, Marc A Kaplan wrote: > Perhaps if you clearly describe what equipment and connections you have > in place and what you're trying to accomplish, someone on this board can > propose a solution. > > In principle, it's always possible to insert proxies/routers to "fake" > any two endpoints into "believing" they are communicating directly. > > > > > > From: Aaron Knister > To: > Date: 09/11/2016 08:01 PM > Subject: Re: [gpfsug-discuss] GPFS Routers > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > type machines rather than some type of GPFS protocol router or proxy. > > -Aaron > > On 9/11/16 5:02 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: >> Hi Everyone, >> >> A while back I seem to recall hearing about a mechanism being developed >> that would function similarly to Lustre's LNET routers and effectively >> allow a single set of NSD servers to talk to multiple RDMA fabrics >> without requiring the NSD servers to have infiniband interfaces on each >> RDMA fabric. Rather, one would have a set of GPFS gateway nodes on each >> fabric that would in effect proxy the RDMA requests to the NSD server. >> Does anyone know what I'm talking about? Just curious if it's still on >> the roadmap. >> >> -Aaron >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 20 15:30:43 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 20 Sep 2016 10:30:43 -0400 Subject: [gpfsug-discuss] GPFS Routers In-Reply-To: <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> References: [gpfsug-discuss] GPFS Routers <5F910253243E6A47B81A9A2EB424BBA101D24881@NDMSMBX404.ndc.nasa.gov> Message-ID: Thanks. That example is simpler than I imagined. Question: If that was indeed your situation and you could afford it, why not just go totally with infiniband switching/routing? Are not the routers just a hack to connect Intel OPA to IB? Ref: https://community.mellanox.com/docs/DOC-2384#jive_content_id_Network_Topology_Design -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Sep 20 16:07:12 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 20 Sep 2016 17:07:12 +0200 Subject: [gpfsug-discuss] CES and nfs pseudo root Message-ID: <20160920150712.2v73hsf7pzrqb3g4@ics.muni.cz> Hello, ganesha allows to specify pseudo root for each export using Pseudo="path". mmnfs export sets pseudo path the same as export dir, e.g., I want to export /mnt/nfs, Pseudo is set to '/mnt/nfs' as well. Can I set somehow Pseudo to '/'? -- Luk?? Hejtm?nek From stef.coene at docum.org Tue Sep 20 18:42:57 2016 From: stef.coene at docum.org (Stef Coene) Date: Tue, 20 Sep 2016 19:42:57 +0200 Subject: [gpfsug-discuss] Ubuntu client Message-ID: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Hi, I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. But I also need 2 clients on Ubuntu 14.04. I installed the GPFS client on the Ubuntu server and used mmbuildgpl to build the required kernel modules. ssh keys are exchanged between GPFS servers and the client. But I can't add the node: [root at gpfs01 ~]# mmaddnode -N client1 Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 mmremote: The CCR environment could not be initialized on node client1. mmaddnode: The CCR environment could not be initialized on node client1. mmaddnode: mmaddnode quitting. None of the specified nodes are valid. mmaddnode: Command failed. Examine previous error messages to determine cause. I don't see any error in /var/mmfs on client and server. What can I try to debug this error? Stef From stef.coene at docum.org Tue Sep 20 18:47:47 2016 From: stef.coene at docum.org (Stef Coene) Date: Tue, 20 Sep 2016 19:47:47 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: <3727524d-aa94-a09e-ebf7-a5d4e1c6f301@docum.org> On 09/20/2016 07:42 PM, Stef Coene wrote: > Hi, > > I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. > But I also need 2 clients on Ubuntu 14.04. > I installed the GPFS client on the Ubuntu server and used mmbuildgpl to > build the required kernel modules. > ssh keys are exchanged between GPFS servers and the client. > > But I can't add the node: > [root at gpfs01 ~]# mmaddnode -N client1 > Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 > mmremote: The CCR environment could not be initialized on node client1. > mmaddnode: The CCR environment could not be initialized on node client1. > mmaddnode: mmaddnode quitting. None of the specified nodes are valid. > mmaddnode: Command failed. Examine previous error messages to determine > cause. > > I don't see any error in /var/mmfs on client and server. > > What can I try to debug this error? Pfff, problem solved. I tailed the logs in /var/adm/ras and found out there was a type in /etc/hosts so the hostname of the client was unresolvable. Stef From YARD at il.ibm.com Tue Sep 20 20:03:39 2016 From: YARD at il.ibm.com (Yaron Daniel) Date: Tue, 20 Sep 2016 22:03:39 +0300 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: Hi Check that kernel symbols are installed too Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Stef Coene To: gpfsug main discussion list Date: 09/20/2016 08:43 PM Subject: [gpfsug-discuss] Ubuntu client Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, I just installed 4.2.1 on 2 RHEL 7.2 servers without any issue. But I also need 2 clients on Ubuntu 14.04. I installed the GPFS client on the Ubuntu server and used mmbuildgpl to build the required kernel modules. ssh keys are exchanged between GPFS servers and the client. But I can't add the node: [root at gpfs01 ~]# mmaddnode -N client1 Tue Sep 20 19:40:09 CEST 2016: mmaddnode: Processing node client1 mmremote: The CCR environment could not be initialized on node client1. mmaddnode: The CCR environment could not be initialized on node client1. mmaddnode: mmaddnode quitting. None of the specified nodes are valid. mmaddnode: Command failed. Examine previous error messages to determine cause. I don't see any error in /var/mmfs on client and server. What can I try to debug this error? Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Wed Sep 21 04:35:57 2016 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 21 Sep 2016 05:35:57 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From stef.coene at docum.org Wed Sep 21 07:03:05 2016 From: stef.coene at docum.org (Stef Coene) Date: Wed, 21 Sep 2016 08:03:05 +0200 Subject: [gpfsug-discuss] Ubuntu client In-Reply-To: References: <5985d614-ebe5-8c85-ec4b-02961e074502@docum.org> Message-ID: <01a37d7a-b5ef-cb3e-5ccb-d5f942df6487@docum.org> On 09/21/2016 05:35 AM, Olaf Weiser wrote: > CCR issues are often related to DNS issues, so check, that you Ubuntu > nodes can resolve the existing nodes accordingly and vise versa > in one line: .. all nodes must be resolvable on every node It was a type in the hostname and /etc/hosts. So problem solved. Stef From xhejtman at ics.muni.cz Wed Sep 21 20:09:32 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 21 Sep 2016 21:09:32 +0200 Subject: [gpfsug-discuss] CES NFS with Kerberos Message-ID: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> Hello, does nfs server (ganesha) work for someone with Kerberos authentication? I got random permission denied: :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s dd: error writing ?testf?: Permission denied 51949+0 records in 51948+0 records out 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s dd: error writing ?testf?: Permission denied 30076+0 records in 30075+0 records out 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s dd: error writing ?testf?: Permission denied 89837+0 records in 89836+0 records out 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s It seems that it is a bug in ganesha: http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 but it is still not resolved. -- Luk?? Hejtm?nek From Greg.Lehmann at csiro.au Wed Sep 21 23:34:09 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 21 Sep 2016 22:34:09 +0000 Subject: [gpfsug-discuss] CES NFS with Kerberos In-Reply-To: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> References: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> Message-ID: <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> It may not be NFS. Check your GPFS logs too. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lukas Hejtmanek Sent: Thursday, 22 September 2016 5:10 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] CES NFS with Kerberos Hello, does nfs server (ganesha) work for someone with Kerberos authentication? I got random permission denied: :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s dd: error writing ?testf?: Permission denied 51949+0 records in 51948+0 records out 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s dd: error writing ?testf?: Permission denied 30076+0 records in 30075+0 records out 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s dd: error writing ?testf?: Permission denied 89837+0 records in 89836+0 records out 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s It seems that it is a bug in ganesha: http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 but it is still not resolved. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From xhejtman at ics.muni.cz Thu Sep 22 09:25:09 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 22 Sep 2016 10:25:09 +0200 Subject: [gpfsug-discuss] CES NFS with Kerberos In-Reply-To: <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> References: <20160921190932.fibmmccccs5kit6x@ics.muni.cz> <94bedafe10a9473c93b0bcc5d34cbea6@exch1-cdc.nexus.csiro.au> Message-ID: <20160922082509.rc53tseeovjnixtz@ics.muni.cz> Hello, thanks, I do not see any error in GPFS logs. The link, I posted below is not related to GPFS at all, it seems that it is bug in ganesha. On Wed, Sep 21, 2016 at 10:34:09PM +0000, Greg.Lehmann at csiro.au wrote: > It may not be NFS. Check your GPFS logs too. > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lukas Hejtmanek > Sent: Thursday, 22 September 2016 5:10 AM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] CES NFS with Kerberos > > Hello, > > does nfs server (ganesha) work for someone with Kerberos authentication? > > I got random permission denied: > :/mnt/nfs-test/tmp# for i in `seq 1 20`; do rm testf; dd if=/dev/zero of=testf bs=1M count=100000; done > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 642.849 s, 163 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 925.326 s, 113 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 762.749 s, 137 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 860.608 s, 122 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 788.62 s, 133 MB/s > dd: error writing ?testf?: Permission denied > 51949+0 records in > 51948+0 records out > 54471426048 bytes (54 GB) copied, 566.667 s, 96.1 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 1082.63 s, 96.9 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 1080.65 s, 97.0 MB/s > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 949.683 s, 110 MB/s > dd: error writing ?testf?: Permission denied > 30076+0 records in > 30075+0 records out > 31535923200 bytes (32 GB) copied, 308.009 s, 102 MB/s > dd: error writing ?testf?: Permission denied > 89837+0 records in > 89836+0 records out > 94199873536 bytes (94 GB) copied, 976.368 s, 96.5 MB/s > > It seems that it is a bug in ganesha: > http://permalink.gmane.org/gmane.comp.file-systems.nfs.ganesha.devel/2000 > > but it is still not resolved. > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek From stef.coene at docum.org Thu Sep 22 19:36:48 2016 From: stef.coene at docum.org (Stef Coene) Date: Thu, 22 Sep 2016 20:36:48 +0200 Subject: [gpfsug-discuss] Blocksize Message-ID: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef From eric.wonderley at vt.edu Thu Sep 22 20:07:30 2016 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 22 Sep 2016 15:07:30 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: It defaults to 4k: mmlsfs testbs8M -i flag value description ------------------- ------------------------ ----------------------------------- -i 4096 Inode size in bytes I think you can make as small as 512b. Gpfs will store very small files in the inode. Typically you want your average file size to be your blocksize and your filesystem has one blocksize and one inodesize. On Thu, Sep 22, 2016 at 2:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Sep 22 20:19:00 2016 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 22 Sep 2016 19:19:00 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <9DA9EC7A281AC7428A9618AFDC49049958EFBB06@CIO-KRC-D1MBX02.osuad.osu.edu> This is a great idea. However there are quite a few other things to consider: -max file count? If you need say a couple of billion files, this will affect things. -wish to store small files in the system pool in late model SS/GPFS? -encryption? No data will be stored in the system pool so large blocks for small file storage in system is pointless. -system pool replication? -HDD vs SSD for system pool? -xxD or array tuning recommendations from your vendor? -streaming vs random IO? Do you have a single dedicated app that has performance like xxx? -probably more I can't think of off the top of my head. etc etc Ed ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Stef Coene [stef.coene at docum.org] Sent: Thursday, September 22, 2016 2:36 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Blocksize Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Thu Sep 22 20:25:03 2016 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 22 Sep 2016 21:25:03 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Thu Sep 22 20:29:43 2016 From: stef.coene at docum.org (Stef Coene) Date: Thu, 22 Sep 2016 21:29:43 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef From luis.bolinches at fi.ibm.com Thu Sep 22 20:37:02 2016 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 22 Sep 2016 19:37:02 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: , <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 22 21:02:24 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 22 Sep 2016 20:02:24 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: We are down to our last few places, so if you do intend to attend, I encourage you to register now! Simon ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Secretary GPFS UG [secretary at gpfsug.org] Sent: 15 September 2016 09:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 22 21:02:24 2016 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 22 Sep 2016 20:02:24 +0000 Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop In-Reply-To: References: Message-ID: We are down to our last few places, so if you do intend to attend, I encourage you to register now! Simon ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Secretary GPFS UG [secretary at gpfsug.org] Sent: 15 September 2016 09:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] SSUG Meet the Devs - Cloud Workshop Hi everyone, Back by popular demand! We are holding a UK 'Meet the Developers' event to focus on Cloud topics. We are very lucky to have Dean Hildebrand, Master Inventor, Cloud Storage Software from IBM over in the UK to lead this session. IS IT FOR ME? Slightly different to the past meet the devs format, this is a cloud workshop aimed at looking how Spectrum Scale fits in the world of cloud. Rather than being a series of presentations and discussions led by IBM, this workshop aims to look at how Spectrum Scale can be used in cloud environments. This will include using Spectrum Scale as an infrastructure tool to power private cloud deployments. We will also look at the challenges of accessing data from cloud deployments and discuss ways in which this might be accomplished. If you are currently deploying OpenStack on Spectrum Scale, or plan to in the near future, then this workshop is for you. Also if you currently have Spectrum Scale and are wondering how you might get that data into cloud-enabled workloads or are currently doing so, then again you should attend. To ensure that the workshop is focused, numbers are limited and we will initially be limiting to 2 people per organisation/project/site. WHAT WILL BE DISCUSSED? Our topics for the day will include on-premise (private) clouds, on-premise self-service (public) clouds, off-premise clouds (Amazon etc.) as well as covering technologies including OpenStack, Docker, Kubernetes and security requirements around multi-tenancy. We probably don't have all the answers for these, but we'd like to understand the requirements and hear people's ideas. Please let us know what you would like to discuss when you register. Arrival is from 10:00 with discussion kicking off from 10:30. The agenda is open discussion though we do aim to talk over a number of key topics. We hope to have the ever popular (though usually late!) pizza for lunch. WHEN Thursday 20th October 2016 from 10:00 AM to 3:30 PM WHERE IT Services, University of Birmingham - Elms Road Edgbaston, Birmingham, B15 2TT REGISTER Please register for the event in advance: https://www.eventbrite.com/e/ssug-meet-the-devs-cloud-workshop-tickets-27725390389 Numbers are limited and we will initially be limiting to 2 people per organisation/project/site. We look forward to seeing you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 22 21:25:10 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 22 Sep 2016 16:25:10 -0400 Subject: [gpfsug-discuss] Blocksize and space and performance for Metadata, release 4.2.x In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: There have been a few changes over the years that may invalidate some of the old advice about metadata and disk allocations there for. These have been phased in over the last few years, I am discussing the present situation for release 4.2.x 1) Inode size. Used to be 512. Now you can set the inodesize at mmcrfs time. Defaults to 4096. 2) Data in inode. If it fits, then the inode holds the data. Since a 512 byte inode still works, you can have more than 3.5KB of data in a 4KB inode. 3) Extended Attributes in Inode. Again, if it fits... Extended attributes used to be stored in a separate file of metadata. So extended attributes performance is way better than the old days. 4) (small) Directories in Inode. If it fits, the inode of a directory can hold the directory entries. That gives you about 2x performance on directory reads, for smallish directories. 5) Big directory blocks. Directories used to use a maximum of 32KB per block, potentially wasting a lot of space and yielding poor performance for large directories. Now directory blocks are the lesser of metadata-blocksize and 256KB. 6) Big directories are shrinkable. Used to be directories would grow in 32KB chunks but never shrink. Yup, even an almost(?) "empty" directory would remain the size the directory had to be at its lifetime maximum. That means just a few remaining entries could be "sprinkled" over many directory blocks. (See also 5.) But now directories autoshrink to avoid wasteful sparsity. Last I looked, the implementation just stopped short of "pushing" tiny directories back into the inode. But a huge directory can be shrunk down to a single (meta)data block. (See --compact in the docs.) --marc of GPFS -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Thu Sep 22 21:49:32 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 22 Sep 2016 13:49:32 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: The current (V4.2+) levels of code support bigger directory block sizes, so it's no longer an issue with something like 1M metadata block size. In fact, there isn't a whole lot of difference between 256K and 1M metadata block sizes, either would work fine. There isn't really a downside in selecting a different block size for metadata though. Inode size (mmcrfs -i option) is orthogonal to the metadata block size selection. We do strongly recommend using 4K inodes to anyone. There's the obvious downside of needing more metadata storage for inodes, but the advantages are significant. yuri From: Jan-Frode Myklebust To: gpfsug main discussion list , Date: 09/22/2016 12:25 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. ? -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Fri Sep 23 02:48:44 2016 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 23 Sep 2016 01:48:44 +0000 Subject: [gpfsug-discuss] Learn a new cluster Message-ID: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> What commands would you run to learn all you need to know about a cluster you?ve never seen before? Captain Obvious (me) says: mmlscluster mmlsconfig mmlsnode mmlsnsd mmlsfs all What others? Mark R. Bush | Solutions Architect This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 23 02:50:52 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 22 Sep 2016 21:50:52 -0400 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> Message-ID: <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> Perhaps a gpfs.snap? This could tell you a *lot* about a cluster. -Aaron On 9/22/16 9:48 PM, Mark.Bush at siriuscom.com wrote: > What commands would you run to learn all you need to know about a > cluster you?ve never seen before? > > Captain Obvious (me) says: > > mmlscluster > > mmlsconfig > > mmlsnode > > mmlsnsd > > mmlsfs all > > > > What others? > > > > > > Mark R. Bush | Solutions Architect > > > > This message (including any attachments) is intended only for the use of > the individual or entity to which it is addressed and may contain > information that is non-public, proprietary, privileged, confidential, > and exempt from disclosure under applicable law. If you are not the > intended recipient, you are hereby notified that any use, dissemination, > distribution, or copying of this communication is strictly prohibited. > This message may be viewed by parties at Sirius Computer Solutions other > than those named in the message header. This message does not contain an > official representation of Sirius Computer Solutions. If you have > received this communication in error, notify Sirius Computer Solutions > immediately and (i) destroy this message if a facsimile or (ii) delete > this message immediately if this is an electronic communication. Thank you. > > Sirius Computer Solutions > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Greg.Lehmann at csiro.au Fri Sep 23 02:53:14 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 23 Sep 2016 01:53:14 +0000 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> Message-ID: <40b22b40d6ed4e38be115e9f6ae8d48d@exch1-cdc.nexus.csiro.au> Nice question. I?d also look at the non-GPFS settings IBM recommend in various places like the FAQ for things like ssh, network, etc. The importance of these is variable depending on cluster size/network configuration etc. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark.Bush at siriuscom.com Sent: Friday, 23 September 2016 11:49 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Learn a new cluster What commands would you run to learn all you need to know about a cluster you?ve never seen before? Captain Obvious (me) says: mmlscluster mmlsconfig mmlsnode mmlsnsd mmlsfs all What others? Mark R. Bush | Solutions Architect This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Sep 23 17:31:59 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 23 Sep 2016 12:31:59 -0400 Subject: [gpfsug-discuss] Learn a new cluster In-Reply-To: <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> References: <7E08F93E-F018-4C00-A78E-71EFDBEAC87C@siriuscom.com> <1ff27b42-aead-fafa-5415-520334d299c1@nasa.gov> Message-ID: <078081B8-E50E-46BE-B3AC-4C1DB6D963E1@ulmer.org> This was going to be my exact suggestion. My short to-learn list includes learn how to look inside a gpfs.snap for what I want to know. I?ve found the ability to do this with other snapshot bundles very useful in the past (for example I?ve used snap on AIX rather than my own scripts in some cases). Do be aware the gpfs.snap (and actually most ?create a bundle for support? commands on most platforms) are a little heavy. Liberty, -- Stephen > On Sep 22, 2016, at 9:50 PM, Aaron Knister wrote: > > Perhaps a gpfs.snap? This could tell you a *lot* about a cluster. > > -Aaron > > On 9/22/16 9:48 PM, Mark.Bush at siriuscom.com wrote: >> What commands would you run to learn all you need to know about a >> cluster you?ve never seen before? >> >> Captain Obvious (me) says: >> >> mmlscluster >> >> mmlsconfig >> >> mmlsnode >> >> mmlsnsd >> >> mmlsfs all >> >> >> >> What others? >> >> >> >> >> >> Mark R. Bush | Solutions Architect >> >> >> >> This message (including any attachments) is intended only for the use of >> the individual or entity to which it is addressed and may contain >> information that is non-public, proprietary, privileged, confidential, >> and exempt from disclosure under applicable law. If you are not the >> intended recipient, you are hereby notified that any use, dissemination, >> distribution, or copying of this communication is strictly prohibited. >> This message may be viewed by parties at Sirius Computer Solutions other >> than those named in the message header. This message does not contain an >> official representation of Sirius Computer Solutions. If you have >> received this communication in error, notify Sirius Computer Solutions >> immediately and (i) destroy this message if a facsimile or (ii) delete >> this message immediately if this is an electronic communication. Thank you. >> >> Sirius Computer Solutions > >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Sep 23 20:16:06 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 23 Sep 2016 15:16:06 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 23 22:35:12 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 23 Sep 2016 14:35:12 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stef.coene at docum.org Fri Sep 23 23:34:49 2016 From: stef.coene at docum.org (Stef Coene) Date: Sat, 24 Sep 2016 00:34:49 +0200 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> Message-ID: <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> On 09/22/2016 08:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... I have read the replies and at the end, this is what we will do: Since the back-end storage will be V5000 with a default stripe size of 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is the best choice for block size. So 2 MB block size for data is the best choice. Since the block size for metadata is not that important in the latest releases, we will also go for 2 MB block size for metadata. Inode size will be left at the default: 4 KB. Stef From mimarsh2 at vt.edu Sat Sep 24 02:21:30 2016 From: mimarsh2 at vt.edu (Brian Marshall) Date: Fri, 23 Sep 2016 21:21:30 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <1d7481bd-c08f-df14-4708-1c2e2a4ac1c0@docum.org> Message-ID: To keep this great chain going: If my metadata is on FLASH, would having a smaller blocksize for the system pool (metadata only) be helpful. My filesystem blocksize is 8MB On Fri, Sep 23, 2016 at 6:34 PM, Stef Coene wrote: > On 09/22/2016 08:36 PM, Stef Coene wrote: > >> Hi, >> >> Is it needed to specify a different blocksize for the system pool that >> holds the metadata? >> >> IBM recommends a 1 MB blocksize for the file system. >> But I wonder a smaller blocksize (256 KB or so) for metadata is a good >> idea or not... >> > I have read the replies and at the end, this is what we will do: > Since the back-end storage will be V5000 with a default stripe size of > 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is > the best choice for block size. > So 2 MB block size for data is the best choice. > > Since the block size for metadata is not that important in the latest > releases, we will also go for 2 MB block size for metadata. > > Inode size will be left at the default: 4 KB. > > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Sat Sep 24 05:07:02 2016 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Sat, 24 Sep 2016 04:07:02 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Not pendant but correct I flip there it is 1/32 -- Cheers > On 23 Sep 2016, at 22.16, Stephen Ulmer wrote: > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. > > -- > Stephen > > > >> On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: >> >> Hi >> >> My 2 cents. >> >> Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) >> >> About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. >> >> No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) >> >> POSIX >> find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out >> GPFS >> cd /usr/lpp/mmfs/samples/ilm >> gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile >> ./mmfind /gpfs/shared -ls -type f > find_ls_files.out >> CONVERT to CSV >> >> POSIX >> cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv >> GPFS >> cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv >> LOAD in octave >> >> FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); >> Clean the second column (OPTIONAL as the next clean up will do the same) >> >> FILESIZE(:,[2]) = []; >> If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) >> >> FILESIZE(FILESIZE<=3584) =[]; >> If we are not we need to clean the 0 size files >> >> FILESIZE(FILESIZE==0) =[]; >> Median >> >> FILESIZEMEDIAN = int32 (median (FILESIZE)) >> Mean >> >> FILESIZEMEAN = int32 (mean (FILESIZE)) >> Variance >> >> int32 (var (FILESIZE)) >> iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. >> >> int32 (iqr (FILESIZE)) >> Standard deviation >> >> >> For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. >> >> >> >> -- >> Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations >> >> Luis Bolinches >> Lab Services >> http://www-03.ibm.com/systems/services/labservices/ >> >> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland >> Phone: +358 503112585 >> >> "If you continually give you will continually have." Anonymous >> >> >> ----- Original message ----- >> From: Stef Coene >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> To: gpfsug main discussion list >> Cc: >> Subject: Re: [gpfsug-discuss] Blocksize >> Date: Thu, Sep 22, 2016 10:30 PM >> >> On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: >> > It defaults to 4k: >> > mmlsfs testbs8M -i >> > flag value description >> > ------------------- ------------------------ >> > ----------------------------------- >> > -i 4096 Inode size in bytes >> > >> > I think you can make as small as 512b. Gpfs will store very small >> > files in the inode. >> > >> > Typically you want your average file size to be your blocksize and your >> > filesystem has one blocksize and one inodesize. >> >> The files are not small, but around 20 MB on average. >> So I calculated with IBM that a 1 MB or 2 MB block size is best. >> >> But I'm not sure if it's better to use a smaller block size for the >> metadata. >> >> The file system is not that large (400 TB) and will hold backup data >> from CommVault. >> >> >> Stef >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> Ellei edell? ole toisin mainittu: / Unless stated otherwise above: >> Oy IBM Finland Ab >> PL 265, 00101 Helsinki, Finland >> Business ID, Y-tunnus: 0195876-3 >> Registered in Finland >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sat Sep 24 15:18:38 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sat, 24 Sep 2016 14:18:38 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sat Sep 24 17:18:11 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 24 Sep 2016 12:18:11 -0400 Subject: [gpfsug-discuss] Blocksize and MetaData Blocksizes - FORGET the old advice In-Reply-To: References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: Metadata is inodes, directories, indirect blocks (indices). Spectrum Scale (GPFS) Version 4.1 introduced significant improvements to the data structures used to represent directories. Larger inodes supporting data and extended attributes in the inode are other significant relatively recent improvements. Now small directories are stored in the inode, while for large directories blocks can be bigger than 32MB, and any and all directory blocks that are smaller than the metadata-blocksize, are allocated just like "fragments" - so directories are now space efficient. SO MUCH SO, that THE OLD ADVICE, about using smallish blocksizes for metadata, GOES "OUT THE WINDOW". Period. FORGET most of what you thought you knew about "best" or "optimal" metadata-blocksize. The new advice is, as Sven wrote: Use a blocksize that optimizes IO transfer efficiency and speed. This is true for BOTH data and metadata. Now, IF you have system pool set up as metadata only AND system pool is on devices that have a different "optimal" block size than your other pools, THEN, it may make sense to use two different blocksizes, one for data and another for metadata. For example, maybe you have massively striped RAID or RAID-LIKE (GSS or ESS)) storage for huge files - so maybe 8MB is a good blocksize for that. But maybe you have your metadata on SSD devices and maybe 1MB is the "best" blocksize for that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sat Sep 24 18:31:37 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 24 Sep 2016 13:31:37 -0400 Subject: [gpfsug-discuss] Blocksize - consider IO transfer efficiency above your other prejudices In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: (I can answer your basic questions, Sven has more experience with tuning very large file systems, so perhaps he will have more to say...) 1. Inodes are packed into the file of inodes. (There is one file of all the inodes in a filesystem). If you have metadata-blocksize 1MB you will have 256 of 4KB inodes per block. Forget about sub-blocks when it comes to the file of inodes. 2. IF a file's data fits in its inode, then migrating that file from one pool to another just changes the preferred pool name in the inode. No data movement. Should the file later "grow" to require a data block, that data block will be allocated from whatever pool is named in the inode at that time. See the email I posted earlier today. Basically: FORGET what you thought you knew about optimal metadata blocksize (perhaps based on how you thought metadata was laid out on disk) and just stick to optimal IO transfer blocksizes. Yes, there may be contrived scenarios or even a few real live special cases, but those would be few and far between. Try following the newer general, easier, rule and see how well it works. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/24/2016 10:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Sat Sep 24 19:16:49 2016 From: stef.coene at docum.org (Stef Coene) Date: Sat, 24 Sep 2016 20:16:49 +0200 Subject: [gpfsug-discuss] Maximum NSD size Message-ID: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> Hi, When formatting the NDS for a new file system, I noticed a warning about a maximum size: Formatting file system ... Disks up to size 8.8 TB can be added to storage pool system. Disks up to size 9.0 TB can be added to storage pool V5000. I searched the docs, but I couldn't find any reference regarding the maximum size of NSDs? Stef From oehmes at gmail.com Sun Sep 25 17:25:40 2016 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 25 Sep 2016 16:25:40 +0000 Subject: [gpfsug-discuss] Maximum NSD size In-Reply-To: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> References: <239fd428-3544-917d-5439-d40ea36f0668@docum.org> Message-ID: the limit you see above is NOT the max NSD limit for Scale/GPFS, its rather the limit of the NSD size you can add to this Filesystems pool. depending on which version of code you are running, we limit the maximum size of a NSD that can be added to a pool so you don't have mixtures of lets say 1 TB and 100 TB disks in one pool as this will negatively affect performance. in older versions we where more restrictive than in newer versions. Sven On Sat, Sep 24, 2016 at 11:16 AM Stef Coene wrote: > Hi, > > When formatting the NDS for a new file system, I noticed a warning about > a maximum size: > > Formatting file system ... > Disks up to size 8.8 TB can be added to storage pool system. > Disks up to size 9.0 TB can be added to storage pool V5000. > > I searched the docs, but I couldn't find any reference regarding the > maximum size of NSDs? > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Sun Sep 25 18:11:12 2016 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 25 Sep 2016 17:11:12 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: well, its not that easy and there is no perfect answer here. so lets start with some data points that might help decide: inodes, directory blocks, allocation maps for data as well as metadata don't follow the same restrictions as data 'fragments' or subblocks, means they are not bond to the 1/32 of the blocksize. they rather get organized on calculated sized blocks which can be very small (significant smaller than 1/32th) or close to the max of the blocksize for a single object. therefore the space waste concern doesn't really apply here. policy scans loves larger blocks as the blocks will be randomly scattered across the NSD's and therefore larger contiguous blocks for inode scan will perform significantly faster on larger metadata blocksizes than on smaller (assuming this is disk, with SSD's this doesn't matter that much) so for disk based systems it is advantageous to use larger blocks , for SSD based its less of an issue. you shouldn't choose on the other hand too large blocks even for disk drive based systems as there is one catch to all this. small updates on metadata typically end up writing the whole metadata block e.g. 256k for a directory block which now need to be destaged and read back from another node changing the same block. hope this helps. Sven On Sat, Sep 24, 2016 at 7:18 AM Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu> wrote: > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 > MB and am very interested in learning the rationale behind this as I am > currently looking at all aspects of our current GPFS configuration and the > possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool > and the default size of an inode is 4K (which we would do, since we have > recently discovered that even on our scratch filesystem we have a bazillion > files that are 4K or smaller and could therefore have their data stored in > the inode, right?), then why would you set the metadata block size to > anything larger than 128K when a sub-block is 1/32nd of a block? I.e., > with a 1 MB block size for metadata wouldn?t you be wasting a massive > amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. Then I > have two other dataOnly pools that are spinning disk. One is for ?regular? > access and the other is the ?capacity? pool ? i.e. a pool of slower storage > where we move files with large access times. I have a policy that says > something like ?move all files with an access time > 6 months to the > capacity pool.? Of those bazillion files less than 4K in size that are > fitting in the inode currently, probably half a bazillion () of them > would be subject to that rule. Will they get moved to the spinning disk > capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > > your metadata block size these days should be 1 MB and there are only very > few workloads for which you should run with a filesystem blocksize below 1 > MB. so if you don't know exactly what to pick, 1 MB is a good starting > point. > the general rule still applies that your filesystem blocksize (metadata or > data pool) should match your raid controller (or GNR vdisk) stripe size of > the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB and > therefore the blocksize of this pool should be 1 MB. i see many customers > in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or > above and your performance will be significant impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block size > (which strengt > > > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------ > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the > block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about > metadata block size. You can specify that the system pool have a different > block size from the rest of the filesystem, providing that it ONLY holds > metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without > some counter-indication), I would think that we?d want to use a metadata > block size of 4K*32=128K. This is independent of the regular block size, > which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I > don?t know what it is. I?d be happy to be corrected about this if it?s out > of whack. > > -- > Stephen > > > > On Sep 22, 2016, at 3:37 PM, Luis Bolinches < > *luis.bolinches at fi.ibm.com* > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave there. > GPFS uses sublocks where 1/16th of the BS can be allocated to different > files, so the "waste" is much less than you think on 1MB and you get the > throughput and less structures of much more data blocks. > > No* warranty at all* but I try to do this when the BS talk comes > in: (might need some clean up it could not be last note but you get the > idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the > same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a lower > number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything could not > manage on a 64GB RAM Power box. Most of the times it is enough with my > laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / > Salutations > > Luis Bolinches > Lab Services > *http://www-03.ibm.com/systems/services/labservices/* > > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene <*stef.coene at docum.org* > > Sent by: *gpfsug-discuss-bounces at spectrumscale.org* > > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very > small > > files in the inode. > > > > Typically you want your average file size to be your blocksize > and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alandhae at gmx.de Mon Sep 26 08:53:48 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Mon, 26 Sep 2016 09:53:48 +0200 (CEST) Subject: [gpfsug-discuss] File-Access Reporting Message-ID: hello all GPFS 'ehmm Spectrum Scale experts out there, we are using GPFS as a Filesystem for a new Data Application. They have defined the need to get reports about: Transfer volume [or file access]: by user, ..., by service, by product type ... at least on a daily basis. they need a report about: fileopen, fileclose, or requestEndTime, requestDuration, fileProductName [path and filename], dataSize. userId. I could think of, using sysstat (sar) for getting some of the numbers, but not being sure, if the numbers we will be receiving are correct. Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From alandhae at gmx.de Mon Sep 26 13:12:18 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Mon, 26 Sep 2016 14:12:18 +0200 (CEST) Subject: [gpfsug-discuss] File_heat for GPFS File Systems Questions over Questions ... Message-ID: Hello GPFS experts, customer wanting a report about the usage of the usage including file_heat in a large Filesystem. The report should be taken every month. mmchconfig fileHeatLossPercent=10,fileHeatPeriodMinutes=30240 -i fileHeatPeriodMinutes=30240 equals to 21 days. I#m wondering about the behavior of fileHeatLossPercent. - If it is set to 10, will file_heat decrease from 1 to 0 in 10 steps? - Or does file_heat have an asymptotic behavior, and heat 0 will never be reached? Anyways the results will be similar ;-) latter taking longer. We want to achieve following file lists: - File_Heat > 50% -> rather hot data - File_Heat 50% < x < 20 -> lukewarm data - File_Heat 20% <= x <= 0% -> ice cold data We will have to work on the limits between the File_Heat classes, depending on customers wishes. Are there better parameter settings for achieving this? Do any scripts/programs exist for analyzing the file_heat data? We have observed when taking policy runs on a large GPFS file system, the meta data performance significantly dropped, until job was finished. It took about 15 minutes on a 880 TB with 150 Mio entries GPFS file system. How is the behavior, when file_heat is being switched on? Do all files in the GPFS have the same temperature? Thanks for your help Ciao Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From makaplan at us.ibm.com Mon Sep 26 16:11:52 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 26 Sep 2016 11:11:52 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: fileHeatLossPercent=10, fileHeatPeriodMinutes=1440 means any file that has not been accessed for 1440 minutes (24 hours = 1 day) will lose 10% of its Heat. So if it's heat was X at noon today, tomorrow 0.90 X, the next day 0.81X, on the k'th day (.90)**k * X. After 63 fileHeatPeriods, we always round down and compute file heat as 0.0. The computation (in floating point with some approximations) is done "on demand" based on a heat value stored in the Inode the last time the unix access "atime" and the current time. So the cost of maintaining FILE_HEAT for a file is some bit twiddling, but only when the file is accessed and the atime would be updated in the inode anyway. File heat increases by approximately 1.0 each time the entire file is read from disk. This is done proportionately so if you read in half of the blocks the increase is 0.5. If you read all the blocks twice FROM DISK the file heat is increased by 2. And so on. But only IOPs are charged. If you repeatedly do posix read()s but the data is in cache, no heat is added. The easiest way to observe FILE_HEAT is with the mmapplypolicy directory -I test -L 2 -P fileheatrule.policy RULE 'fileheatrule' LIST 'hot' SHOW('Heat=' || varchar(FILE_HEAT)) /* in file fileheatfule.policy */ Because policy reads metadata from inodes as stored on disk, when experimenting/testing you may need to mmfsctl fs suspend-write; mmfsctl fs resume to see results immediately. From: Andreas Landh?u?er To: gpfsug-discuss at spectrumscale.org Date: 09/26/2016 08:12 AM Subject: [gpfsug-discuss] File_heat for GPFS File Systems Questions over Questions ... Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello GPFS experts, customer wanting a report about the usage of the usage including file_heat in a large Filesystem. The report should be taken every month. mmchconfig fileHeatLossPercent=10,fileHeatPeriodMinutes=30240 -i fileHeatPeriodMinutes=30240 equals to 21 days. I#m wondering about the behavior of fileHeatLossPercent. - If it is set to 10, will file_heat decrease from 1 to 0 in 10 steps? - Or does file_heat have an asymptotic behavior, and heat 0 will never be reached? Anyways the results will be similar ;-) latter taking longer. We want to achieve following file lists: - File_Heat > 50% -> rather hot data - File_Heat 50% < x < 20 -> lukewarm data - File_Heat 20% <= x <= 0% -> ice cold data We will have to work on the limits between the File_Heat classes, depending on customers wishes. Are there better parameter settings for achieving this? Do any scripts/programs exist for analyzing the file_heat data? We have observed when taking policy runs on a large GPFS file system, the meta data performance significantly dropped, until job was finished. It took about 15 minutes on a 880 TB with 150 Mio entries GPFS file system. How is the behavior, when file_heat is being switched on? Do all files in the GPFS have the same temperature? Thanks for your help Ciao Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 26 19:18:15 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 26 Sep 2016 11:18:15 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ulmer at ulmer.org Mon Sep 26 20:01:56 2016 From: ulmer at ulmer.org (Stephen Ulmer) Date: Mon, 26 Sep 2016 15:01:56 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> Message-ID: <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > > It's important to understand the differences between different metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. > > The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. > the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. > > -- > Stephen > > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Mon Sep 26 20:29:18 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Mon, 26 Sep 2016 12:29:18 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Message-ID: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri From: Stephen Ulmer To: gpfsug main discussion list , Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme < oehmes at us.ibm.com> wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinches at fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene < stef.coene at docum.org> Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list < gpfsug-discuss at spectrumscale.org> Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From alandhae at gmx.de Tue Sep 27 10:04:02 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Tue, 27 Sep 2016 11:04:02 +0200 (CEST) Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: On Mon, 26 Sep 2016, Marc A Kaplan wrote: Marc, thanks for your explanation, > fileHeatLossPercent=10, fileHeatPeriodMinutes=1440 > > means any file that has not been accessed for 1440 minutes (24 hours = 1 > day) will lose 10% of its Heat. > > So if it's heat was X at noon today, tomorrow 0.90 X, the next day 0.81X, > on the k'th day (.90)**k * X. > After 63 fileHeatPeriods, we always round down and compute file heat as > 0.0. > > The computation (in floating point with some approximations) is done "on > demand" based on a heat value stored in the Inode the last time the unix > access "atime" and the current time. So the cost of maintaining > FILE_HEAT for a file is some bit twiddling, but only when the file is > accessed and the atime would be updated in the inode anyway. > > File heat increases by approximately 1.0 each time the entire file is read > from disk. This is done proportionately so if you read in half of the > blocks the increase is 0.5. > If you read all the blocks twice FROM DISK the file heat is increased by > 2. And so on. But only IOPs are charged. If you repeatedly do posix > read()s but the data is in cache, no heat is added. with the above definition file heat >= 0.0 e.g. any positive floating point value is valid. I need to categorize the files into categories hot, warm, lukewarm and cold. How do I achieve this, since the maximum heat is varying and need to be defined every time when requesting the report. We are wishing to migrate data according to the heat onto different storage categories (expensive --> cheap devices) > The easiest way to observe FILE_HEAT is with the mmapplypolicy directory > -I test -L 2 -P fileheatrule.policy > > RULE 'fileheatrule' LIST 'hot' SHOW('Heat=' || varchar(FILE_HEAT)) /* in > file fileheatfule.policy */ > > Because policy reads metadata from inodes as stored on disk, when > experimenting/testing you may need to > > mmfsctl fs suspend-write; mmfsctl fs resume Doing this on a production file system, a valid change request need to be filed, and description of the risks for customers data and so on have to be defined (ITIL) ... Any help and ideas will be appreciated Andreas -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From makaplan at us.ibm.com Tue Sep 27 15:25:04 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 10:25:04 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: You asked ... "We are wishing to migrate data according to the heat onto different storage categories (expensive --> cheap devices)" We suggest a policy rule like this: Rule 'm' Migrate From Pool 'Expensive' To Pool 'Thrifty' Threshold(90,75) Weight(-FILE_HEAT) /* minus sign! */ Which you can interpret as: When The 'Expensive' pool is 90% or more full, Migrate the lowest heat (coldest!) files to pool 'Thrifty', until the occupancy of 'Expensive' has been reduced to 75%. The concepts of Threshold and Weight have been in the produce since the MIGRATE rule was introduced. Another concept we introduced at the same time as FILE_HEAT was GROUP POOL. We've had little feedback and very few questions about this, so either it works great or is not being used much. (Maybe both are true ;-) ) GROUP POOL migration is documented in the Information Lifecycle Management chapter along with the other elements of the policy rules. In the 4.2.1 doc we suggest you can "repack" several pools with one GROUP POOL rule and one MIGRATE rule like this: You can ?repack? a group pool by WEIGHT. Migrate files of higher weight to preferred disk pools by specifying a group pool as both the source and the target of a MIGRATE rule. rule ?grpdef? GROUP POOL ?gpool? IS ?ssd? LIMIT(90) THEN ?fast? LIMIT(85) THEN ?sata? rule ?repack? MIGRATE FROM POOL ?gpool? TO POOL ?gpool? WEIGHT(FILE_HEAT) This should rank all the files in the three pools from hottest to coldest, and migrate them as necessary (if feasible) so that 'ssd' is up to 90% full of the hottest, 'fast' is up to 85% full of the next most hot, and the coolest files will be migrated to 'sata'. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 27 18:02:45 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 27 Sep 2016 17:02:45 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> Message-ID: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Yuri / Sven / anyone else who wants to jump in, First off, thank you very much for your answers. I?d like to follow up with a couple of more questions. 1) Let?s assume that our overarching goal in configuring the block size for metadata is performance from the user perspective ? i.e. how fast is an ?ls -l? on my directory? Space savings aren?t important, and how long policy scans or other ?administrative? type tasks take is not nearly as important as that directory listing. Does that change the recommended metadata block size? 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional HPC use for those two) and /data (project space). Our storage arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 mirror, and two hot spare drives. The RAID 1 mirrors are for /home, the RAID 6 LUNs are for /scratch or /data. /home has tons of small files - so small that a 64K block size is currently used. /scratch and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. If you could ?start all over? with the same hardware being the only restriction, would you: a) merge /scratch and /data into one filesystem but keep /home separate since the LUN sizes are so very different, or b) merge all three into one filesystem and use storage pools so that /home is just a separate pool within the one filesystem? And if you chose this option would you assign different block sizes to the pools? Again, I?m asking these questions because I may have the opportunity to effectively ?start all over? and want to make sure I?m doing things as optimally as possible. Thanks? Kevin On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: From: Stephen Ulmer > To: gpfsug main discussion list >, Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Now I?ve got anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: * There are items OTHER THAN INODES stored as metadata in GPFS. * These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size ? or if we don?t pick a ?reasonable? metadata block size after picking a ?large? file system block size that applies to both. * Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. * If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. * We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it?s been the example as the ?good? size at least three times in this thread. * Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list >, Date: 09/24/2016 07:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn?t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here?s a related question ? let?s just say I have the above configuration ? my system pool is metadata only and is on SSD?s. Then I have two other dataOnly pools that are spinning disk. One is for ?regular? access and the other is the ?capacity? pool ? i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like ?move all files with an access time > 6 months to the capacity pool.? Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis?s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (?metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we?d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you?re lucky. There could be a great reason NOT to use 128K metadata block size, but I don?t know what it is. I?d be happy to be corrected about this if it?s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous ----- Original message ----- From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Blocksize Date: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > It defaults to 4k: > mmlsfs testbs8M -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 4096 Inode size in bytes > > I think you can make as small as 512b. Gpfs will store very small > files in the inode. > > Typically you want your average file size to be your blocksize and your > filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 27 18:16:52 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 13:16:52 -0400 Subject: [gpfsug-discuss] Blocksize, yea, inode size! In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: inode size will be a crucial choice in the scenario you describe. Consider the conflict: A large inode can hold a complete file or a complete directory. But the bigger the inode size, the less that fit in any given block size -- so when you have to read several inodes ... more IO, less likely that inodes you want are in the same block. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Tue Sep 27 18:23:34 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Tue, 27 Sep 2016 10:23:34 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: On 09/27/2016 10:02 AM, Buterbaugh, Kevin L wrote: > 1) Let?s assume that our overarching goal in configuring the block size > for metadata is performance from the user perspective ? i.e. how fast is > an ?ls -l? on my directory? Space savings aren?t important, and how > long policy scans or other ?administrative? type tasks take is not > nearly as important as that directory listing. Does that change the > recommended metadata block size? You need to put your metadata on SSDs. Make your SSDs the only members in your 'system' pool and put your other devices into another pool, and make that pool 'dataOnly'. If your SSDs are large enough to also hold some data, that's great; I typically do a migration policy to copy files smaller than filesystem block size (or definitely smaller than sub-block size) to the SSDs. Also, files smaller than 4k will usually fit into the inode (if you are using the 4k inode size). I have a system where the SSDs are regularly doing 6-7k IOPS for metadata stuff. If those same 7k IOPS were spread out over the slow data LUNs... which only have like 100 IOPS per 8+2P LUN... I'd be consuming 700 disks just for metadata IOPS. -- Alex Chekholko chekh at stanford.edu From kevindjo at us.ibm.com Tue Sep 27 18:33:29 2016 From: kevindjo at us.ibm.com (Kevin D Johnson) Date: Tue, 27 Sep 2016 17:33:29 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From alandhae at gmx.de Tue Sep 27 19:04:06 2016 From: alandhae at gmx.de (=?UTF-8?Q?Andreas_Landh=c3=a4u=c3=9fer?=) Date: Tue, 27 Sep 2016 20:04:06 +0200 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: as far as I understand, if a file gets hot again, there is no rule for putting the file back into a faster storage device? We would like having something like a storage elevator depending on the fileheat. In our setup, customer likes to migrate/move data even when the the threshold is not hit, just because it's cold and the price of the storage is less. On 27.09.2016 16:25, Marc A Kaplan wrote: > > You asked ... "We are wishing to migrate data according to the heat > onto different > storage categories (expensive --> cheap devices)" > > > We suggest a policy rule like this: > > Rule 'm' Migrate From Pool 'Expensive' To Pool 'Thrifty' > Threshold(90,75) Weight(-FILE_HEAT) /* minus sign! */ > > > Which you can interpret as: > > When The 'Expensive' pool is 90% or more full, Migrate the lowest heat > (coldest!) files to pool 'Thrifty', until > the occupancy of 'Expensive' has been reduced to 75%. > > The concepts of Threshold and Weight have been in the produce since > the MIGRATE rule was introduced. > > Another concept we introduced at the same time as FILE_HEAT was GROUP > POOL. We've had little feedback and very > few questions about this, so either it works great or is not being > used much. (Maybe both are true ;-) ) > > GROUP POOL migration is documented in the Information Lifecycle > Management chapter along with the other elements of the policy rules. > > In the 4.2.1 doc we suggest you can "repack" several pools with one > GROUP POOL rule and one MIGRATE rule like this: > > You can ?repack? a group pool by *WEIGHT*. Migrate files of higher > weight to preferred disk pools > by specifying a group pool as both the source and the target of a > *MIGRATE *rule. > > rule ?grpdef? GROUP POOL ?gpool? IS ?ssd? LIMIT(90) THEN ?fast? > LIMIT(85) THEN ?sata? > rule ?repack? MIGRATE FROM POOL ?gpool? TO POOL ?gpool? WEIGHT(FILE_HEAT) > > > This should rank all the files in the three pools from hottest to > coldest, and migrate them > as necessary (if feasible) so that 'ssd' is up to 90% full of the > hottest, 'fast' is up to 85% full of the next > most hot, and the coolest files will be migrated to 'sata'. > > > > -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Sep 27 19:12:19 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 27 Sep 2016 18:12:19 +0000 Subject: [gpfsug-discuss] File_heat for GPFS File Systems Message-ID: <0217AC60-11F0-4CEB-AE91-22D25E4649DC@nuance.com> Sure, if you use a policy to migrate between two tiers, it will move files up or down based on heat. Something like this (flas and disk pools): rule grpdef GROUP POOL gpool IS flash LIMIT(75) THEN Disk rule repack MIGRATE FROM POOL gpool TO POOL gpool WEIGHT(FILE_HEAT) Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid 507-269-0413 From: on behalf of Andreas Landh?u?er Reply-To: gpfsug main discussion list Date: Tuesday, September 27, 2016 at 1:04 PM To: Marc A Kaplan , gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] File_heat for GPFS File Systems as far as I understand, if a file gets hot again, there is no rule for putting the file back into a faster storage device? -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Tue Sep 27 19:26:46 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Tue, 27 Sep 2016 11:26:46 -0700 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org><17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > To: gpfsug main discussion list , > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Tue Sep 27 19:51:50 2016 From: chekh at stanford.edu (Alex Chekholko) Date: Tue, 27 Sep 2016 11:51:50 -0700 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> On 09/27/2016 11:04 AM, Andreas Landh?u?er wrote: > if a file gets hot again, there is no rule for putting the file back > into a faster storage device? The file will get moved when you run the policy again. You can run the policy as often as you like. There is also a way to use a GPFS hook to trigger policy run. Check 'mmaddcallback' But I think you have to be careful and think through the complexity. e.g. load spikes and pool fills up and your callback kicks in and starts a migration which increases the I/O load further, etc... Regards, -- Alex Chekholko chekh at stanford.edu From makaplan at us.ibm.com Tue Sep 27 20:27:47 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 27 Sep 2016 15:27:47 -0400 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: References: Message-ID: Read about GROUP POOL - you can call as often as you like to "repack" the files into several pools from hot to cold. Of course, there is a cost to running mmapplypolicy... So maybe you'd just run it once every day or so... -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Sep 27 20:38:16 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 27 Sep 2016 21:38:16 +0200 Subject: [gpfsug-discuss] Samba via CES Message-ID: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Hello, does CES offer high availability of SMB? I.e., does used samba server provide cluster wide persistent handles? Or failover from node to node is currently not supported without the client disruption? -- Luk?? Hejtm?nek From erich at uw.edu Tue Sep 27 21:56:20 2016 From: erich at uw.edu (Eric Horst) Date: Tue, 27 Sep 2016 13:56:20 -0700 Subject: [gpfsug-discuss] File_heat for GPFS File Systems In-Reply-To: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> References: <8c7fd395-1efc-7197-4a98-763ba784cafd@stanford.edu> Message-ID: >> >> if a file gets hot again, there is no rule for putting the file back >> into a faster storage device? > > > The file will get moved when you run the policy again. You can run the > policy as often as you like. I think its worth stating clearly that if a file is in the Thrifty slow pool and a user opens and reads/writes the file there is nothing that moves this file to a different tier. A policy run is the only action that relocates files. So if you apply the policy daily and over the course of the day users access many cold files, the performance accessing those cold files may not be ideal until the next day when they are repacked by heat. A file is not automatically moved to the fast tier on access read or write. I mention this because this aspect of tiering was not immediately clear from the docs when I was a neophyte GPFS admin and I had to learn by observation. It is easy for one to make an assumption that it is a more dynamic tiering system than it is. -Eric -- Eric Horst University of Washington From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 27 22:21:23 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 27 Sep 2016 21:21:23 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> Message-ID: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Sep 27 22:36:37 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 27 Sep 2016 14:36:37 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Message-ID: When a CES node fails, protocol clients have to reconnect to one of the remaining nodes. Samba in CES does not support persistent handles. This is indicated in the documentation: http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits "Only mandatory SMB3 protocol features are supported. " Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug-discuss at spectrumscale.org Date: 09/27/2016 12:38 PM Subject: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, does CES offer high availability of SMB? I.e., does used samba server provide cluster wide persistent handles? Or failover from node to node is currently not supported without the client disruption? -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From xhejtman at ics.muni.cz Tue Sep 27 22:42:57 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 27 Sep 2016 23:42:57 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> Message-ID: <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> On Tue, Sep 27, 2016 at 02:36:37PM -0700, Christof Schmitt wrote: > When a CES node fails, protocol clients have to reconnect to one of the > remaining nodes. > > Samba in CES does not support persistent handles. This is indicated in the > documentation: > > http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits > > "Only mandatory SMB3 protocol features are supported. " well, but in this case, HA feature is a bit pointless as node fail results in a client failure as well as reconnect does not seem to be automatic if there is on going traffic.. more precisely reconnect is automatic but without persistent handles, the client receives write protect error immediately. -- Luk?? Hejtm?nek From Greg.Lehmann at csiro.au Wed Sep 28 08:40:35 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 07:40:35 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> References: <4890d1b0-2f12-f8b6-684d-c98ca2b71ab7@docum.org> <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org> <13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org> <6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: <428599f3d6cb47ebb74d05178eeba2b8@exch1-cdc.nexus.csiro.au> I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alandhae at gmx.de Wed Sep 28 10:13:55 2016 From: alandhae at gmx.de (=?ISO-8859-15?Q?Andreas_Landh=E4u=DFer?=) Date: Wed, 28 Sep 2016 11:13:55 +0200 (CEST) Subject: [gpfsug-discuss] Proposal for dynamic heat assisted file tiering Message-ID: On Tue, 27 Sep 2016, Eric Horst wrote: Thanks Eric for the hint, shouldn't we as the users define a requirement for such a dynamic heat assisted file tiering option (DHAFTO). Keeping track which files have increased heat and triggering a transparent move to a faster tier. Since I haven't tested it on a GPFS FS, I would like to know about the performance penalties being observed, when frequently running the policies, just a rough estimate. Of course its depending on the speed of the Metadata disks (yes, we use different devices for Metadata) we are also running GPFS on various GSS Systems. IBM might also want bundling this option together with GSS/ESS hardware for better performance. Just my 2? Andreas >>> >>> if a file gets hot again, there is no rule for putting the file back >>> into a faster storage device? >> >> >> The file will get moved when you run the policy again. You can run the >> policy as often as you like. > > I think its worth stating clearly that if a file is in the Thrifty > slow pool and a user opens and reads/writes the file there is nothing > that moves this file to a different tier. A policy run is the only > action that relocates files. So if you apply the policy daily and over > the course of the day users access many cold files, the performance > accessing those cold files may not be ideal until the next day when > they are repacked by heat. A file is not automatically moved to the > fast tier on access read or write. I mention this because this aspect > of tiering was not immediately clear from the docs when I was a > neophyte GPFS admin and I had to learn by observation. It is easy for > one to make an assumption that it is a more dynamic tiering system > than it is. -- Andreas Landh?u?er +49 151 12133027 (mobile) alandhae at gmx.de From Robert.Oesterlin at nuance.com Wed Sep 28 11:56:51 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 28 Sep 2016 10:56:51 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution Message-ID: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: on behalf of "Greg.Lehmann at csiro.au" Reply-To: gpfsug main discussion list Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Sep 28 14:45:14 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 28 Sep 2016 13:45:14 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: Greg, Not saying this is the right way to go, but I rolled my own. I wrote a very simple Perl script that essentially does the Perl equivalent of a find on my GPFS filesystems, then stat?s files and directories and writes the output to a text file. I run that one overnight or on the weekends. Takes about 6 hours to run across our 3 GPFS filesystems with metadata on SSDs. Then I?ve written a couple of different Perl scripts to analyze that data in different ways (the idea being that collecting the data is ?expensive? ? but once you?ve got it it?s cheap to analyze it in different ways). But the one I?ve been using for this project just breaks down the number of files and directories by size and age and produces a table. Rather than try to describe this, here?s sample output: For input file: gpfsFileInfo_20160915.txt <1 day | <1 wk | <1 mo | <2 mo | <3 mo | <4 mo | <5 mo | <6 mo | <1 yr | >1 year | Total Files <1 KB 29538 111364 458260 634398 150199 305715 4388443 93733 966618 3499535 10637803 <2 KB 9875 20580 119414 295167 35961 67761 80462 33688 269595 851641 1784144 <4 KB 9212 45282 168678 496796 27771 23887 105135 23161 259242 1163327 2322491 <8 KB 4992 29284 105836 303349 28341 20346 246430 28394 216061 1148459 2131492 <16 KB 3987 18391 92492 218639 20513 19698 675097 30976 190691 851533 2122017 <32 KB 4844 12479 50235 265222 24830 18789 1058433 18030 196729 1066287 2715878 <64 KB 6358 24259 29474 222134 17493 10744 1381445 11358 240528 1123540 3067333 <128 KB 6531 59107 206269 186213 71823 114235 1008724 36722 186357 845921 2721902 <256 KB 1995 17638 19355 436611 8505 7554 3582738 7519 249510 744885 5076310 <512 KB 20645 12401 24700 111463 5659 22132 1121269 10774 273010 725155 2327208 <1 MB 2681 6482 37447 58459 6998 14945 305108 5857 160360 386152 984489 <4 MB 4554 84551 23320 100407 6818 32833 129758 22774 210935 528458 1144408 <1 GB 56652 33538 99667 87778 24313 68372 118928 42554 251528 916493 1699823 <10 GB 1245 2482 4524 3184 1043 1794 2733 1694 8731 20462 47892 <100 GB 47 230 470 265 92 198 172 122 1276 2061 4933 >100 GB 2 3 12 1 14 4 5 1 37 165 244 Total TB: 6.49 13.22 30.56 18.00 10.22 15.69 19.87 12.48 73.47 187.44 Grand Total: 387.46 TB Everything other than the total space lines at the bottom are counts of number of files meeting that criteria. I?ve got another variation on the same script that we used when we were trying to determine how many old files we have and therefore how much data was an excellent candidate for moving to a slower, less expensive ?capacity? pool. I?m not sure how useful my tools would be to others ? I?m certainly not a professional programmer by any stretch of the imagination (and yes, I do hear those of you who are saying, ?Yeah, he?s barely a professional SysAdmin!? ). But others of you have been so helpful to me ? I?d like to try in some small way to help someone else. Kevin On Sep 28, 2016, at 5:56 AM, Oesterlin, Robert > wrote: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: > on behalf of "Greg.Lehmann at csiro.au" > Reply-To: gpfsug main discussion list > Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 15:34:05 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 10:34:05 -0400 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... SHOW rule) to gather the stats much faster. Should be minutes, not hours. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 16:23:12 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 11:23:12 -0400 Subject: [gpfsug-discuss] Blocksize In-Reply-To: <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org><6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 28 16:33:29 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 28 Sep 2016 11:33:29 -0400 Subject: [gpfsug-discuss] Proposal for dynamic heat assisted file tiering In-Reply-To: References: Message-ID: Suppose, we could "dynamically" change the pool assignment of a file. How/when would you have us do that? When will that generate unnecessary, "wasteful" IOPs? How do we know if/when/how often you will access a file in the future? This is similar to other classical caching policies, but there the choice is usually just which pages to flush from the cache when we need space ... The usual compromise is "LRU" but maybe some systems allow hints. When there are multiple pools, it seems more complicated, more degrees of freedom ... Would you be willing and able to write some new policy rules to provide directions to Spectrum Scale for dynamic tiering? What would that look like? Would it be worth the time and effort over what we have now? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Sep 28 19:13:35 2016 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 28 Sep 2016 18:13:35 +0000 Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? Message-ID: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> What the largest file that will fit inside a 1K, 2K, or 4K inode? Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Wed Sep 28 21:18:55 2016 From: ewahl at osc.edu (Edward Wahl) Date: Wed, 28 Sep 2016 16:18:55 -0400 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: <20160928161855.1df32434@osc.edu> On Wed, 28 Sep 2016 10:34:05 -0400 Marc A Kaplan wrote: > Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... > SHOW rule) to gather the stats much faster. Should be minutes, not > hours. > I'll agree with the policy engine. Runs like a beast if you tune it a little for nodes and threads. Only takes a couple of minutes to collect info on over a hundred million files. Show where the data is now by pool and sort it by age with queries? quick hack up example. you could sort the mess on the front end fairly quickly. (use fileset or pool, etc as your storage needs) RULE '2yrold_files' LIST '2yrold_filelist.txt' SHOW (varchar(file_size) || ' ' || varchar(USER_ID) || ' ' || varchar(POOL_NAME)) WHERE DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) >= 730 AND DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) < 1095 don't forget to run the engine with the -I defer for this kind of list/show policy. Ed -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From christof.schmitt at us.ibm.com Wed Sep 28 21:33:45 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 28 Sep 2016 13:33:45 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: The client has to reconnect, open the file again and reissue request that have not been completed. Without persistent handles, the main risk is that another client can step in and access the same file in the meantime. With persistent handles, access from other clients would be prevented for a defined amount of time. Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug main discussion list Date: 09/27/2016 02:43 PM Subject: Re: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org On Tue, Sep 27, 2016 at 02:36:37PM -0700, Christof Schmitt wrote: > When a CES node fails, protocol clients have to reconnect to one of the > remaining nodes. > > Samba in CES does not support persistent handles. This is indicated in the > documentation: > > http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_smbexportlimits.htm#bl1adm_smbexportlimits > > "Only mandatory SMB3 protocol features are supported. " well, but in this case, HA feature is a bit pointless as node fail results in a client failure as well as reconnect does not seem to be automatic if there is on going traffic.. more precisely reconnect is automatic but without persistent handles, the client receives write protect error immediately. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bbanister at jumptrading.com Wed Sep 28 21:56:47 2016 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 28 Sep 2016 20:56:47 +0000 Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? In-Reply-To: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> References: <0D55AB1D-DB9D-45CF-AB27-157CDA1172D9@nuance.com> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB0633CA80@CHI-EXCHANGEW1.w2k.jumptrading.com> I think the guideline for 4K inodes is roughly 3.5KB depending on use of extended attributes, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Wednesday, September 28, 2016 1:14 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Biggest file that will fit inside an inode? What the largest file that will fit inside a 1K, 2K, or 4K inode? Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Sep 28 23:03:36 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 29 Sep 2016 00:03:36 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> On Wed, Sep 28, 2016 at 01:33:45PM -0700, Christof Schmitt wrote: > The client has to reconnect, open the file again and reissue request that > have not been completed. Without persistent handles, the main risk is that > another client can step in and access the same file in the meantime. With > persistent handles, access from other clients would be prevented for a > defined amount of time. well, I guess I cannot reconfigure the client so that reissuing request is done by OS and not rised up to the user? E.g., if user runs video encoding directly to Samba share and encoding runs for several hours, reissuing request, i.e., restart encoding, is not exactly what user accepts. -- Luk?? Hejtm?nek From abeattie at au1.ibm.com Wed Sep 28 23:25:01 2016 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 28 Sep 2016 22:25:01 +0000 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> References: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz>, <20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Wed Sep 28 23:49:31 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 22:49:31 +0000 Subject: [gpfsug-discuss] Blocksize - file size distribution In-Reply-To: References: Message-ID: <2ed56fe8c9c34eb5a1da25800b2951e0@exch1-cdc.nexus.csiro.au> Kevin, Thanks for the offer of help. I am capable of writing my own, but it looks like the best approach is to use mmapplypolicy, something I had not thought of. This is precisely the reason I asked what looks like a silly question. You don?t know what you don?t know! The quality of content on this list has been exceptional of late! Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 11:45 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize - file size distribution Greg, Not saying this is the right way to go, but I rolled my own. I wrote a very simple Perl script that essentially does the Perl equivalent of a find on my GPFS filesystems, then stat?s files and directories and writes the output to a text file. I run that one overnight or on the weekends. Takes about 6 hours to run across our 3 GPFS filesystems with metadata on SSDs. Then I?ve written a couple of different Perl scripts to analyze that data in different ways (the idea being that collecting the data is ?expensive? ? but once you?ve got it it?s cheap to analyze it in different ways). But the one I?ve been using for this project just breaks down the number of files and directories by size and age and produces a table. Rather than try to describe this, here?s sample output: For input file: gpfsFileInfo_20160915.txt <1 day | <1 wk | <1 mo | <2 mo | <3 mo | <4 mo | <5 mo | <6 mo | <1 yr | >1 year | Total Files <1 KB 29538 111364 458260 634398 150199 305715 4388443 93733 966618 3499535 10637803 <2 KB 9875 20580 119414 295167 35961 67761 80462 33688 269595 851641 1784144 <4 KB 9212 45282 168678 496796 27771 23887 105135 23161 259242 1163327 2322491 <8 KB 4992 29284 105836 303349 28341 20346 246430 28394 216061 1148459 2131492 <16 KB 3987 18391 92492 218639 20513 19698 675097 30976 190691 851533 2122017 <32 KB 4844 12479 50235 265222 24830 18789 1058433 18030 196729 1066287 2715878 <64 KB 6358 24259 29474 222134 17493 10744 1381445 11358 240528 1123540 3067333 <128 KB 6531 59107 206269 186213 71823 114235 1008724 36722 186357 845921 2721902 <256 KB 1995 17638 19355 436611 8505 7554 3582738 7519 249510 744885 5076310 <512 KB 20645 12401 24700 111463 5659 22132 1121269 10774 273010 725155 2327208 <1 MB 2681 6482 37447 58459 6998 14945 305108 5857 160360 386152 984489 <4 MB 4554 84551 23320 100407 6818 32833 129758 22774 210935 528458 1144408 <1 GB 56652 33538 99667 87778 24313 68372 118928 42554 251528 916493 1699823 <10 GB 1245 2482 4524 3184 1043 1794 2733 1694 8731 20462 47892 <100 GB 47 230 470 265 92 198 172 122 1276 2061 4933 >100 GB 2 3 12 1 14 4 5 1 37 165 244 Total TB: 6.49 13.22 30.56 18.00 10.22 15.69 19.87 12.48 73.47 187.44 Grand Total: 387.46 TB Everything other than the total space lines at the bottom are counts of number of files meeting that criteria. I?ve got another variation on the same script that we used when we were trying to determine how many old files we have and therefore how much data was an excellent candidate for moving to a slower, less expensive ?capacity? pool. I?m not sure how useful my tools would be to others ? I?m certainly not a professional programmer by any stretch of the imagination (and yes, I do hear those of you who are saying, ?Yeah, he?s barely a professional SysAdmin!? ). But others of you have been so helpful to me ? I?d like to try in some small way to help someone else. Kevin On Sep 28, 2016, at 5:56 AM, Oesterlin, Robert > wrote: /usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: > on behalf of "Greg.Lehmann at csiro.au" > Reply-To: gpfsug main discussion list > Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post ? we already use SSDs for our metadata. Which leads me to yet another question ? of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I?m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore ? how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up ? or I?m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there?s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there?s also a huge potential $$$ cost. Thanks again ? I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev > wrote: > 1) Let?s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective ? i.e. > how fast is an ?ls -l? on my directory? Space savings aren?t > important, and how long policy scans or other ?administrative? type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help with performance, and that was one of the considerations for making 4K inodes the default in V4.2+ versions. > 2) Let?s assume we have 3 filesystems, /home, /scratch (traditional > HPC use for those two) and /data (project space). Our storage > arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 > mirror, and two hot spare drives. The RAID 1 mirrors are for /home, > the RAID 6 LUNs are for /scratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the ?sweet spot? there. > > If you could ?start all over? with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I?ve got > anther question? which I?ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > > To: gpfsug main discussion list >, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Now I?ve got anther question? which I?ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size ? or if we don?t pick a ?reasonable? > metadata block size after picking a ?large? file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it?s been the example as the ?good? size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev > wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of > this actually make a practical difference? That depends on the file > system composition, namely the number of IBs (which is a function of > the number of large files) and larger directories. Calculating this > scientifically can be pretty involved, and really should be the job > of a tool that ought to exist, but doesn't (yet). A more practical > approach is doing a ballpark estimate using local file counts and > typical fractions of large files and directories, using statistics > available from published papers. > > The performance implications of a given metadata block size choice > is a subject of nearly infinite depth, and this question ultimately > can only be answered by doing experiments with a specific workload > on specific hardware. The metadata space utilization efficiency is > something that can be answered conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi > Sven, I am confused by your statement that the metadata block size > should be 1 MB and am very int > > From: "Buterbaugh, Kevin L" > > To: gpfsug main discussion list >, > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should > be 1 MB and am very interested in learning the rationale behind this > as I am currently looking at all aspects of our current GPFS > configuration and the possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system > pool and the default size of an inode is 4K (which we would do, > since we have recently discovered that even on our scratch > filesystem we have a bazillion files that are 4K or smaller and > could therefore have their data stored in the inode, right?), then > why would you set the metadata block size to anything larger than > 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block > size for metadata wouldn?t you be wasting a massive amount of space? > > What am I missing / confused about there? > > Oh, and here?s a related question ? let?s just say I have the above > configuration ? my system pool is metadata only and is on SSD?s. > Then I have two other dataOnly pools that are spinning disk. One is > for ?regular? access and the other is the ?capacity? pool ? i.e. a > pool of slower storage where we move files with large access times. > I have a policy that says something like ?move all files with an > access time > 6 months to the capacity pool.? Of those bazillion > files less than 4K in size that are fitting in the inode currently, > probably half a bazillion () of them would be subject to that > rule. Will they get moved to the spinning disk capacity pool or will > they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > On Sep 23, 2016, at 4:35 PM, Sven Oehme > wrote: > your metadata block size these days should be 1 MB and there are > only very few workloads for which you should run with a filesystem > blocksize below 1 MB. so if you don't know exactly what to pick, 1 > MB is a good starting point. > the general rule still applies that your filesystem blocksize > (metadata or data pool) should match your raid controller (or GNR > vdisk) stripe size of the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB > and therefore the blocksize of this pool should be 1 MB. i see many > customers in the field using 1MB or even smaller blocksize on RAID > stripes of 2 MB or above and your performance will be significant > impacted by that. > > Sven > > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block > size (which strengt > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 > of the block size (which strengthens Luis?s arguments below). > > I thought the the original question was NOT about inode size, but > about metadata block size. You can specify that the system pool have > a different block size from the rest of the filesystem, providing > that it ONLY holds metadata (?metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems > without some counter-indication), I would think that we?d want to > use a metadata block size of 4K*32=128K. This is independent of the > regular block size, which you can calculate based on the workload if > you?re lucky. > > There could be a great reason NOT to use 128K metadata block size, > but I don?t know what it is. I?d be happy to be corrected about this > if it?s out of whack. > > -- > Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches > wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small > files (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest > that you will actually benefit from smaller than 1MB block, leave > there. GPFS uses sublocks where 1/16th of the BS can be allocated to > different files, so the "waste" is much less than you think on 1MB > and you get the throughput and less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: > (might need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to > inodes (WELL not exactly ... extended attributes! so maybe use a > lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and > lower quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful > machine to run the calculations on octave, I never hit anything > could not manage on a 64GB RAM Power box. Most of the times it is > enough with my laptop. > > > > -- > Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > ----- Original message ----- > From: Stef Coene > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flag value description > > ------------------- ------------------------ > > ----------------------------------- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Wed Sep 28 23:54:36 2016 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Wed, 28 Sep 2016 22:54:36 +0000 Subject: [gpfsug-discuss] Blocksize In-Reply-To: References: <17781503-26B3-4448-B7B9-1EE27ABE6D1F@ulmer.org><13272A1B-425D-4DDD-A931-490604F92D61@ulmer.org><6C51B04A-9097-4598-8B4A-C484A3D98EE2@vanderbilt.edu> <4BDC0E5A-176A-48CC-8DE6-93C7B5A3F138@vanderbilt.edu> Message-ID: Are there any presentation available online that provide diagrams of the directory/file creation process and modifications in terms of how the blocks/inodes and indirect blocks etc are used. I would guess there are a few different cases that would need to be shown. This is the sort of thing that would great in a decent text book on GPFS (doesn't exist as far as I am aware.) Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Thursday, 29 September 2016 1:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Sep 28 23:58:15 2016 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 29 Sep 2016 00:58:15 +0200 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: References: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> <20160927193815.kpppiy76wpudg6cj@ics.muni.cz> <20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz> On Wed, Sep 28, 2016 at 10:25:01PM +0000, Andrew Beattie wrote: > In that scenario, would you not be better off using a native Spectrum > Scale client installed on the workstation that the video editor is using > with a local mapped drive, rather than a SMB share? > ? > This would prevent this the scenario you have proposed occurring. indeed, it would be better, but why one would have CES at all? I would like to use CES but it seems that it is not quite ready yet for such a scenario. -- Luk?? Hejtm?nek From christof.schmitt at us.ibm.com Thu Sep 29 00:06:59 2016 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 28 Sep 2016 16:06:59 -0700 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> References: <20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz> Message-ID: The exact behavior depends on the client and the application. I would suggest explicit testing of the protocol failover if that is a concern. Samba does not support persistent handles, so that would be a completely new feature. There is some support available for durable handles which have weaker guarantees, and which are also disabled in CES Samba due to known issues in large deployments. In cases where SMB protocol failover becomes an issue and durable handles might help, that might be an approach to improve the failover behavior. Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: Lukas Hejtmanek To: gpfsug main discussion list Date: 09/28/2016 03:04 PM Subject: Re: [gpfsug-discuss] Samba via CES Sent by: gpfsug-discuss-bounces at spectrumscale.org On Wed, Sep 28, 2016 at 01:33:45PM -0700, Christof Schmitt wrote: > The client has to reconnect, open the file again and reissue request that > have not been completed. Without persistent handles, the main risk is that > another client can step in and access the same file in the meantime. With > persistent handles, access from other clients would be prevented for a > defined amount of time. well, I guess I cannot reconfigure the client so that reissuing request is done by OS and not rised up to the user? E.g., if user runs video encoding directly to Samba share and encoding runs for several hours, reissuing request, i.e., restart encoding, is not exactly what user accepts. -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From abeattie at au1.ibm.com Thu Sep 29 00:37:25 2016 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 28 Sep 2016 23:37:25 +0000 Subject: [gpfsug-discuss] Samba via CES In-Reply-To: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz> References: <20160928225815.rpzdzjevro37ur7b@ics.muni.cz>, <20160928220336.d5vdwp7fejbj2bzf@ics.muni.cz><20160927193815.kpppiy76wpudg6cj@ics.muni.cz><20160927214257.z4ezmssnpwhmm4rk@ics.muni.cz> Message-ID: An HTML attachment was scrubbed... URL: From aaron.knister at gmail.com Thu Sep 29 02:43:52 2016 From: aaron.knister at gmail.com (Aaron Knister) Date: Wed, 28 Sep 2016 21:43:52 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at us.ibm.com Thu Sep 29 03:28:03 2016 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 28 Sep 2016 19:28:03 -0700 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: Hi Aaron, the best way to express this 'need' is to vote and leave comments in the RFE's : this is an RFE for GNR as SW : http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090 everybody who wants this to be one should vote for it and leave comments on what they expect. Sven From: Aaron Knister To: gpfsug-discuss at spectrumscale.org Date: 09/28/2016 06:44 PM Subject: Re: [gpfsug-discuss] gpfs native raid Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From daniel.kidger at uk.ibm.com Thu Sep 29 10:04:03 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Sep 2016 09:04:03 +0000 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT1-graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From daniel.kidger at uk.ibm.com Thu Sep 29 10:25:59 2016 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Sep 2016 09:25:59 +0000 Subject: [gpfsug-discuss] AFM cacheset mounting from the same GPFS cluster ? Message-ID: An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 29 16:03:08 2016 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 29 Sep 2016 15:03:08 +0000 Subject: [gpfsug-discuss] Fwd: Blocksize References: Message-ID: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Resending from the right e-mail address... Begin forwarded message: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:36 AM CDT To: klb at accre.vanderbilt.edu You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at gpfsug-discuss-owner at spectrumscale.org. From: "Kevin L. Buterbaugh" > Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:29 AM CDT To: gpfsug main discussion list > Hi Marc and others, I understand ? I guess I did a poor job of wording my question, so I?ll try again. The IBM recommendation for metadata block size seems to be somewhere between 256K - 1 MB, depending on who responds to the question. If I were to hypothetically use a 256K metadata block size, does the ?1/32nd of a block? come into play like it does for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode size) minimum? And here?s a really off the wall question ? yesterday we were discussing the fact that there is now a single inode file. Historically, we have always used RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for metadata and then use GPFS replication on top of that. But given that there is a single inode file is that ?old way? of doing things still the right way? In other words, could we potentially be better off by using a couple of 8+2P RAID 6 LUNs? One potential downside of that would be that we would then only have two NSD servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN and splitting it up into multiple logical volumes (all that done on the storage array, of course) and then presenting those to GPFS as NSDs??? Or have I gone from merely asking stupid questions to Trump-level craziness???? ;-) Kevin On Sep 28, 2016, at 10:23 AM, Marc A Kaplan > wrote: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 29 16:32:47 2016 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 29 Sep 2016 11:32:47 -0400 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: Frankly, I just don't "get" what it is you seem not to be "getting" - perhaps someone else who does "get" it can rephrase: FORGET about Subblocks when thinking about inodes being packed into the file of all inodes. Additional facts that may address some of the other concerns: I started working on GPFS at version 3.1 or so. AFAIK GPFS always had and has one file of inodes, "packed", with no wasted space between inodes. Period. Full Stop. RAID! Now we come to a mistake that I've seen made by more than a handful of customers! It is generally a mistake to use RAID with parity (such as classic RAID5) to store metadata. Why? Because metadata is often updated with "small writes" - for example suppose we have to update some fields in an inode, or an indirect block, or append a log record... For RAID with parity and large stripe sizes -- this means that updating just one disk sector can cost a full stripe read + writing the changed data and parity sectors. SO, if you want protection against storage failures for your metadata, use either RAID mirroring/replication and/or GPFS metadata replication. (belt and/or suspenders) (Arguments against relying solely on RAID mirroring: single enclosure/box failure (fire!), single hardware design (bugs or defects), single firmware/microcode(bugs.)) Yes, GPFS is part of "the cyber." We're making it stronger everyday. But it already is great. --marc From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/29/2016 11:03 AM Subject: [gpfsug-discuss] Fwd: Blocksize Sent by: gpfsug-discuss-bounces at spectrumscale.org Resending from the right e-mail address... Begin forwarded message: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:36 AM CDT To: klb at accre.vanderbilt.edu You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at gpfsug-discuss-owner at spectrumscale.org. From: "Kevin L. Buterbaugh" Subject: Re: [gpfsug-discuss] Blocksize Date: September 29, 2016 at 10:00:29 AM CDT To: gpfsug main discussion list Hi Marc and others, I understand ? I guess I did a poor job of wording my question, so I?ll try again. The IBM recommendation for metadata block size seems to be somewhere between 256K - 1 MB, depending on who responds to the question. If I were to hypothetically use a 256K metadata block size, does the ?1/32nd of a block? come into play like it does for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode size) minimum? And here?s a really off the wall question ? yesterday we were discussing the fact that there is now a single inode file. Historically, we have always used RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for metadata and then use GPFS replication on top of that. But given that there is a single inode file is that ?old way? of doing things still the right way? In other words, could we potentially be better off by using a couple of 8+2P RAID 6 LUNs? One potential downside of that would be that we would then only have two NSD servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN and splitting it up into multiple logical volumes (all that done on the storage array, of course) and then presenting those to GPFS as NSDs??? Or have I gone from merely asking stupid questions to Trump-level craziness???? ;-) Kevin On Sep 28, 2016, at 10:23 AM, Marc A Kaplan wrote: OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root at n2 charts]# mmlsfs x2K -i flag value description ------------------- ------------------------ ----------------------------------- -i 2048 Inode size in bytes Works for me! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Sep 29 16:38:56 2016 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 29 Sep 2016 17:38:56 +0200 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Thu Sep 29 19:00:40 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 29 Sep 2016 11:00:40 -0700 Subject: [gpfsug-discuss] Fwd: Blocksize In-Reply-To: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> References: <423B687F-0B03-4C6F-9F16-E05F68491D67@vanderbilt.edu> Message-ID: > to the question. If I were to hypothetically use a 256K metadata > block size, does the ?1/32nd of a block? come into play like it does > for ?not metadata?? I.e. 256 / 32 = 8K, so am I reading / writing > *2* inodes (assuming 4K inode size) minimum? I think the point of confusion here is minimum allocation size vs minimum IO size -- those two are not one and the same. In fact in GPFS those are largely unrelated values. For low-level metadata files where multiple records are packed into the same block, it is possible to read/write either an individual record (such as an inode), or an entire block of records (which is what happens, for example, during inode copy-on-write). The minimum IO size in GPFS is 512 bytes. On a "4K-aligned" file system, GPFS vows to only do IOs in multiples of 4KiB. For data, GPFS tracks what portion of a given block is valid/dirty using an in-memory bitmap, and if 4K in the middle of a 16M block are modified, only 4K get written, not 16M (although this is more complicated for sparse file writes and appends, when some areas need to be zeroed out). For metadata writes, entire metadata objects are written, using the actual object size, rounded up to the nearest 512B or 4K boundary, as needed. So a single modified inode results in a single inode write, regardless of the metadata block size. If you have snapshots, and the inode being modified needs to be copied to the previous snapshot, and happens to be the first inode in the block that needs a COW, an entire block of inodes is copied to the latest snapshot, as an optimization. > And here?s a really off the wall question ? yesterday we were > discussing the fact that there is now a single inode file. > Historically, we have always used RAID 1 mirrors (first with > spinning disk, as of last fall now on SSD) for metadata and then use > GPFS replication on top of that. But given that there is a single > inode file is that ?old way? of doing things still the right way? > In other words, could we potentially be better off by using a couple > of 8+2P RAID 6 LUNs? The old way is also the modern way in this case. Using RAID1 LUNs for GPFS metadata is still the right approach. You don't want to use RAID erasure codes that trigger read-modify-write for small IOs, which are typical for metadata (unless your RAID array has so much cache as to make RMW a moot point). > One potential downside of that would be that we would then only have > two NSD servers serving up metadata, so we discussed the idea of > taking each RAID 6 LUN and splitting it up into multiple logical > volumes (all that done on the storage array, of course) and then > presenting those to GPFS as NSDs??? Like most performance questions, this one can ultimately only be answered definitively by running tests, but offhand I would suspect that the performance impact of RAID6, combined with extra contention for physical disks, is going to more than offset the benefits of using more NSD servers. Keep in mind that you aren't limited to 2 NSD servers per LUN. If you actually have the connectivity for more than 2 nodes on your RAID controller, GPFS allows up to 8 simultaneously active NSD servers per NSD. yuri > On Sep 28, 2016, at 10:23 AM, Marc A Kaplan wrote: > > OKAY, I'll say it again. inodes are PACKED into a single inode > file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. > There is no wasted space. > > (Of course if you have metadata replication = 2, then yes, double > that. And yes, there overhead for indirect blocks (indices), > allocation maps, etc, etc.) > > And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good > choice for your data distribution, to optimize packing of data and/ > or directories into inodes... > > Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... > > mmcrfs x2K -i 2048 > > [root at n2 charts]# mmlsfs x2K -i > flag value description > ------------------- ------------------------ > ----------------------------------- > -i 2048 Inode size in bytes > > Works for me! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From volobuev at us.ibm.com Fri Sep 30 06:43:53 2016 From: volobuev at us.ibm.com (Yuri L Volobuev) Date: Thu, 29 Sep 2016 22:43:53 -0700 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: The issue of "GNR as software" is a pretty convoluted mixture of technical, business, and resource constraints issues. While some of the technical issues can be discussed here, obviously the other considerations cannot be discussed in a public forum. So you won't be able to get a complete understanding of the situation by discussing it here. > I understand the support concerns, but I naively thought that assuming > the hardware meets a basic set of requirements (e.g. redundant sas > paths, x type of drives) it would be fairly supportable with GNR. The > DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla > I thought. Setting business issues aside, this is more complicated on the technical level than one may think. At present, GNR requires a set of twin-tailed external disk enclosures. This is not a particularly exotic kind of hardware, but it turns out that this corner of the storage world is quite insular. GNR has a very close relationship with physical disk devices, much more so than regular GPFS. In an ideal world, SCSI and SES standards are supposed to provide a framework which would allow software like GNR to operate on an arbitrary disk enclosure. In the real world, the actual SES implementations on various enclosures that we've been dealing with are, well, peculiar. Apparently SES is one of those standards where vendors feel a lot of freedom in "re-interpreting" the standard, and since typically enclosures talk to a small set of RAID controllers, there aren't bad enough consequences to force vendors to be religious about SES standard compliance. Furthermore, the SAS fabric topology in configurations with an external disk enclosures is surprisingly complex, and that complexity predictably leads to complex failures which don't exist in simpler configurations. Thus far, every single one of the five enclosures we've had a chance to run GNR on required some adjustments, workarounds, hacks, etc. And the consequences of a misbehaving SAS fabric can be quite dire. There are various approaches to dealing with those complications, from running a massive 3rd party hardware qualification program to basically declaring any complications from an unknown enclosure to be someone else's problem (how would ZFS deal with a SCSI reset storm due to a bad SAS expander?), but there's much debate on what is the right path to take. Customer input/feedback is obviously very valuable in tilting such discussions in the right direction. yuri From: Aaron Knister To: gpfsug-discuss at spectrumscale.org, Date: 09/28/2016 06:44 PM Subject: Re: [gpfsug-discuss] gpfs native raid Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA). Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of: - Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it. I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought. - IBM would like to monetize the product and compete with the likes of DDN/Seagate This is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud. The big things I was looking to GNR for are: - end-to-end checksums - implementing a software RAID layer on (in my case enterprise class) JBODs I can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me. That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem. I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS. -Aaron On 9/1/16 9:59 AM, Marc A Kaplan wrote: > I've been told that it is a big leap to go from supporting GSS and ESS > to allowing and supporting native raid for customers who may throw > together "any" combination of hardware they might choose. > > In particular the GNR "disk hospital" functions... > https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm > will be tricky to support on umpteen different vendor boxes -- and keep > in mind, those will be from IBM competitors! > > That said, ESS and GSS show that IBM has some good tech in this area and > IBM has shown with the Spectrum Scale product (sans GNR) it can support > just about any semi-reasonable hardware configuration and a good slew of > OS versions and architectures... Heck I have a demo/test version of GPFS > running on a 5 year old Thinkpad laptop.... And we have some GSSs in the > lab... Not to mention Power hardware and mainframe System Z (think 360, > 370, 290, Z) > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stef.coene at docum.org Fri Sep 30 14:03:01 2016 From: stef.coene at docum.org (Stef Coene) Date: Fri, 30 Sep 2016 15:03:01 +0200 Subject: [gpfsug-discuss] Toolkit Message-ID: Hi, When using the toolkit, all config data is stored in clusterdefinition.txt When you modify the cluster with mm* commands, the toolkit is unaware of these changes. Is it possible to recreate the clusterdefinition.txt based on the current configuration? Stef From matthew at ellexus.com Fri Sep 30 16:30:11 2016 From: matthew at ellexus.com (Matthew Harris) Date: Fri, 30 Sep 2016 16:30:11 +0100 Subject: [gpfsug-discuss] Introduction from Ellexus Message-ID: Hello everyone, Ellexus is the IO profiling company with tools for load balancing shared storage, solving IO performance issues and detecting rogue jobs that have bad IO patterns. We have a good number of customers who use Spectrum Scale so we do a lot of work to support it. We have clients and partners working across the HPC space including semiconductor, life sciences, oil and gas, automotive and finance. We're based in Cambridge, England. Some of you will have already met our CEO, Rosemary Francis. Looking forward to meeting you at SC16. Matthew Harris Account Manager & Business Development - Ellexus Ltd *www.ellexus.com * *Ellexus Ltd is a limited company registered in England & Wales* *Company registration no. 07166034* *Registered address: 198 High Street, Tonbridge, Kent TN9 1BE, UK* *Operating address: St John's Innovation Centre, Cowley Road, Cambridge CB4 0WS, UK* -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 30 21:56:29 2016 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 30 Sep 2016 16:56:29 -0400 Subject: [gpfsug-discuss] gpfs native raid In-Reply-To: References: <96282850-6bfa-73ae-8502-9e8df3a56390@nasa.gov> Message-ID: <2f59d32a-fc0f-3f03-dd95-3465611dc841@nasa.gov> Thanks, Yuri. Your replies are always quite enjoyable to read. I didn't realize SES was such a loosely interpreted standard, I just assumed it was fairly straightforward. We've got a number of JBODs here we manage via SES using the linux enclosure module (e.g. /sys/class/enclosure) and they seem to "just work" but we're not doing anything terribly advanced, mostly just turning on/off various status LEDs. I should clarify, the newer SAS enclosures I've encountered seem quite good, some of the older enclosures (e.g. in particular the Xyratex enclosure used by DDN in it's S2A units) were a real treat to interact with and didn't seem to follow the SES standard in spirit. I can certainly accept the complexity argument here. I think for our purposes a "reasonable level" of support would be all we're after. I'm not sure how ZFS would deal with a SCSI reset storm, I suspect the pool would just offline itself if enough paths seemed to disappear or timeout. If I could make GPFS work well with ZFS as the underlying storage target I would be quite happy. So far I have struggled to make it performant. GPFS seems to assume once a block device accepts the write that it's committed to stable storage. With ZFS ZVOL's this isn't the case by default. Making it the case (setting the sync=always paremter) causes a *massive* degradation in performance. If GPFS were to issue sync commands at appropriate intervals then I think we could make this work well. I'm not sure how to go about this, though, and given frequent enough scsi sync commands to a given lun its performance would likely degrade to the current state of zfs with sync=always. At any rate, we'll see how things go. Thanks again. -Aaron On 9/30/16 1:43 AM, Yuri L Volobuev wrote: > The issue of "GNR as software" is a pretty convoluted mixture of > technical, business, and resource constraints issues. While some of the > technical issues can be discussed here, obviously the other > considerations cannot be discussed in a public forum. So you won't be > able to get a complete understanding of the situation by discussing it here. > >> I understand the support concerns, but I naively thought that assuming >> the hardware meets a basic set of requirements (e.g. redundant sas >> paths, x type of drives) it would be fairly supportable with GNR. The >> DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla >> I thought. > > Setting business issues aside, this is more complicated on the technical > level than one may think. At present, GNR requires a set of twin-tailed > external disk enclosures. This is not a particularly exotic kind of > hardware, but it turns out that this corner of the storage world is > quite insular. GNR has a very close relationship with physical disk > devices, much more so than regular GPFS. In an ideal world, SCSI and > SES standards are supposed to provide a framework which would allow > software like GNR to operate on an arbitrary disk enclosure. In the > real world, the actual SES implementations on various enclosures that > we've been dealing with are, well, peculiar. Apparently SES is one of > those standards where vendors feel a lot of freedom in "re-interpreting" > the standard, and since typically enclosures talk to a small set of RAID > controllers, there aren't bad enough consequences to force vendors to be > religious about SES standard compliance. Furthermore, the SAS fabric > topology in configurations with an external disk enclosures is > surprisingly complex, and that complexity predictably leads to complex > failures which don't exist in simpler configurations. Thus far, every > single one of the five enclosures we've had a chance to run GNR on > required some adjustments, workarounds, hacks, etc. And the > consequences of a misbehaving SAS fabric can be quite dire. There are > various approaches to dealing with those complications, from running a > massive 3rd party hardware qualification program to basically declaring > any complications from an unknown enclosure to be someone else's problem > (how would ZFS deal with a SCSI reset storm due to a bad SAS expander?), > but there's much debate on what is the right path to take. Customer > input/feedback is obviously very valuable in tilting such discussions in > the right direction. > > yuri > > Inactive hide details for Aaron Knister ---09/28/2016 06:44:23 > PM---Thanks Everyone for your replies! (Quick disclaimer, these Aaron > Knister ---09/28/2016 06:44:23 PM---Thanks Everyone for your replies! > (Quick disclaimer, these opinions are my own, and not those of my > > From: Aaron Knister > To: gpfsug-discuss at spectrumscale.org, > Date: 09/28/2016 06:44 PM > Subject: Re: [gpfsug-discuss] gpfs native raid > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Thanks Everyone for your replies! (Quick disclaimer, these opinions are > my own, and not those of my employer or NASA). > > Not knowing what's coming at the NDA session, it seems to boil down to > "it ain't gonna happen" because of: > > - Perceived difficulty in supporting whatever creative hardware > solutions customers may throw at it. > > I understand the support concerns, but I naively thought that assuming > the hardware meets a basic set of requirements (e.g. redundant sas > paths, x type of drives) it would be fairly supportable with GNR. The > DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla > I thought. > > - IBM would like to monetize the product and compete with the likes of > DDN/Seagate > > This is admittedly a little disappointing. GPFS as long as I've known it > has been largely hardware vendor agnostic. To see even a slight shift > towards hardware vendor lockin and certain features only being supported > and available on IBM hardware is concerning. It's not like the software > itself is free. Perhaps GNR could be a paid add-on license for non-IBM > hardware? Just thinking out-loud. > > The big things I was looking to GNR for are: > > - end-to-end checksums > - implementing a software RAID layer on (in my case enterprise class) JBODs > > I can find a way to do the second thing, but the former I cannot. > Requiring IBM hardware to get end-to-end checksums is a huge red flag > for me. That's something Lustre will do today with ZFS on any hardware > ZFS will run on (and for free, I might add). I would think GNR being > openly available to customers would be important for GPFS to compete > with Lustre. Furthermore, I had opened an RFE (#84523) a while back to > implement checksumming of data for non-GNR environments. The RFE was > declined because essentially it would be too hard and it already exists > for GNR. Well, considering I don't have a GNR environment, and hardware > vendor lock in is something many sites are not interested in, that's > somewhat of a problem. > > I really hope IBM reconsiders their stance on opening up GNR. The > current direction, while somewhat understandable, leaves a really bad > taste in my mouth and is one of the (very few, in my opinion) features > Lustre has over GPFS. > > -Aaron > > > On 9/1/16 9:59 AM, Marc A Kaplan wrote: >> I've been told that it is a big leap to go from supporting GSS and ESS >> to allowing and supporting native raid for customers who may throw >> together "any" combination of hardware they might choose. >> >> In particular the GNR "disk hospital" functions... >> https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm >> will be tricky to support on umpteen different vendor boxes -- and keep >> in mind, those will be from IBM competitors! >> >> That said, ESS and GSS show that IBM has some good tech in this area and >> IBM has shown with the Spectrum Scale product (sans GNR) it can support >> just about any semi-reasonable hardware configuration and a good slew of >> OS versions and architectures... Heck I have a demo/test version of GPFS >> running on a 5 year old Thinkpad laptop.... And we have some GSSs in the >> lab... Not to mention Power hardware and mainframe System Z (think 360, >> 370, 290, Z) >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776