From henrik.cednert at filmlance.se Tue Feb 5 09:47:39 2019 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 5 Feb 2019 09:47:39 +0000 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Message-ID: Hello Odd. Apparently i have issues posting to the list again. Sorry if this comes in double. I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Feb 4 19:12:29 2019 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 4 Feb 2019 19:12:29 +0000 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Message-ID: <2D68B7E7-EBDF-4FDD-BE54-202920A08595@filmlance.se> Hello I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Feb 5 20:09:07 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 5 Feb 2019 12:09:07 -0800 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? In-Reply-To: References: Message-ID: Hello Henrik, What you are seeing has to do with whether UAC (User Access Control) is enabled/disabled on Windows. On Windows 7 and 2012R2 etc, my guess is that you have disabled UAC (since that is what GPFS required in the past). When UAC is disabled, the default owner of a local file/dir created by a user that is member of Administrators group, is set as Administrators (SID = S-1-5-32-544). That is mapped to autogenerated-id 15,000,000 in your case. On Windows 10 (where UAC MUST stay enabled), the behavior changes. When UAC is not disabled (and NOT running elevated), the default owner of a local file/dir created by a user that is member of Administrators group, is set to that user SID. Hence, it is not S-1-5-32-544, rather a unique SID for that local user. In absence of AD setup and RFC 2307 mappings, GPFS is auto-mapping that user SID to 15,000, 270 in your case. As you see, the state of UAC results in different owners. You simply cannot disable UAC on Windows 10 (and newer versions) since it breaks certain OS components! Hence, to get consistent behavior (the latter semantics where file owner = user SID), you could enable UAC on Windows 7/2012R2 to default (instead of disabling it). GPFS 4.2.3.12 works with UAC enabled. Remember though that the old 15,000,000 is on-disk ACL structures, hence you will have to explicitly set/change owner to yourself (to update to 15,000,270) for existing files. Any new files/dirs though should default to 15,000,270. You could also add an ACL entry for Administrators group or individual users granting desired access instead of relying on file ownership for access rights. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Henrik Cednert (Filmlance)" To: gpfsug main discussion list Date: 02/05/2019 11:21 AM Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Odd. Apparently i have issues posting to the list again. Sorry if this comes in double. I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_( ?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=WEtGqEikAHptrhNUxYjEd8vfm1bPVcbCgEcMH4rp-UM&s=MeyrAfodvNKjIFQuVsfXbLlTAQvTBnUVgvNJqv901RA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Thu Feb 7 16:09:17 2019 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Thu, 07 Feb 2019 16:09:17 +0000 Subject: [gpfsug-discuss] UK Spectrum Scale User Group Sponsorship packages Message-ID: We're currently in the process of planning for the 2019 UK Spectrum Scale User Group meeting, to be held in London on 8th/9th May and will again be looking for commercial sponsorship to support the event. I'll be sending a message out to companies who have previously sponsored us with details soon, however if you would like to be contacted about the sponsorship packages, please drop me an email and I'll include your company when we send out the details. Thanks Simon From techie879 at gmail.com Sat Feb 9 01:42:13 2019 From: techie879 at gmail.com (Imam Toufique) Date: Fri, 8 Feb 2019 17:42:13 -0800 Subject: [gpfsug-discuss] question on fileset / quotas Message-ID: Hi Everyone, I am very new to GPFS, just got the system up and running, now starting to get myself setting up filesets and quotas. I have a question, may be in has been answered already, somewhere in this forum, my apologies for that if this is a repeat. My question is: lets say I have an independent fileset called '/mmfs1/crsp_test' , and I set it's quota to 2GB ( quota type FILESET ). STAGING-itoufiqu at crspoitdc-mgmt-001:/mmfs1/crsp_test/itoufiqu$ df -h /mmfs1/crsp_test/ Filesystem Size Used Avail Use% Mounted on mmfs1 2.0G 0 2.0G 0% /mmfs1 Now, I go create a 'dependent' fileset calld 'itoufiqu' under 'crsp_test' , sharing it's inode space, and i was able to set it's quota to 4GB. STAGING-root at crspoitdc-mgmt-001:/mmfs1/crsp_test$ df -h /mmfs1/crsp_test/itoufiqu Filesystem Size Used Avail Use% Mounted on mmfs1 4.0G 4.0G 0 100% /mmfs1 Now, i assume that setting quota of 4GB ( whereas the independent fileset quota is 2GB ) for the above dependent fileset ( 'itoufiqu' ) is being allowed as dependent fileset is sharing inode space from the independent fileset. Is there a way to setup an independent fileset so that it's dependent filesets cannot exceed its quota limit? Another words, if my independent fileset quota is 2GB, I should not be allowed to set quotas for it's dependent filesets more then 2GB ( for the dependent filesets created in aggregate ) ? Thanks for your help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Feb 9 04:02:49 2019 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 08 Feb 2019 23:02:49 -0500 Subject: [gpfsug-discuss] question on fileset / quotas In-Reply-To: References: Message-ID: <32045.1549684969@turing-police.cc.vt.edu> On Fri, 08 Feb 2019 17:42:13 -0800, Imam Toufique said: > Is there a way to setup an independent fileset so that it's dependent > filesets cannot exceed its quota limit? Another words, if my independent > fileset quota is 2GB, I should not be allowed to set quotas for it's > dependent filesets more then 2GB ( for the dependent filesets created in > aggregate ) ? Well.. to set the quota on the dependent fileset, you have to be root. And the general Unix/Linux philosophy is to not prevent the root user from doing things unless there's a good technical reason(*). There's a lot of "here be dragons" corner cases - for instance, if I create /gpfs/parent and give it 10T of space, does that mean that *each* dependent fileset is limited to 10T, or the *sum* has to remain under 10T? (In other words, is overcommit allowed?). There's other problems, like "Give the parent 8T, give two children 4T each, let each one store 3T, and then reduce the parent quota to 2T" - what should happen then? And quite frankly, the fact that mmrepquota has an entire column of output for "uncertain" when only dealing with *one* fileset tells me there's not sane way to avoid race conditions when dealing with two filesets without some truly performance-ruining levels of filesystem locking. So I'd say that probably, it's more reasonable to do this outside GPFS - anything from telling everybody who knows the root password not to do it, to teaching whatever automation/provisioning system you have (Ansible, etc) to enforce it. Having said that, if you can nail down the semantics and then make a good business case that it should be done inside of GPFS rather than at the sysadmin level, I'm sure IBM would be willing to at least listen to an RFE.... (*) I remember one Unix variant (Gould's UTX/32) that was perfectly willing to let C code running as root do an unlink(".") rather than return EISDIR even though it meant you just bought yourself a shiny new fsck - don't ask how I found out :) From rohwedder at de.ibm.com Mon Feb 11 09:36:06 2019 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Mon, 11 Feb 2019 10:36:06 +0100 Subject: [gpfsug-discuss] question on fileset / quotas In-Reply-To: References: Message-ID: Hello, There is no hierarchy between fileset quotas, the fileset quota limits are completely independent of each other. The independent fileset., as you mentioned, provides the common inode space and ties the parent and child together in regards to using inodes from their common inode space and for example in regards to snapshots and other features that act on independent filesets. There are however many degrees of freedom in setting up quota configurations, for example user and group quotas and the per-fileset and per-filesystem quota options. So there may be other ways how you could create rules that can model your environment and which could provide a means to create limits across several filesets. For example (will probably not match to you but just to illustrate):: You have a group of applications. Each application stores data in one dependent fileset. The filesystem where these exist uses per filesystem quota accounting.- All these filesets are children of an independent filesets. this allows you to create snapshots of all applications together. All applications store data under the same group. You can limit each applications space via fileset quota and you can limit the whole application group via group quota. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland Research & Development E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany From: Imam Toufique To: gpfsug-discuss at spectrumscale.org Date: 09.02.2019 02:42 Subject: [gpfsug-discuss] question on fileset / quotas Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Everyone, I am very new to GPFS, just got the system up and running, now starting to get myself setting up filesets and quotas.? I have a question, may be in has been answered already, somewhere in this forum, my apologies for that if this is a repeat. My question is: lets say I have an independent fileset called '/mmfs1/crsp_test' , and I set it's quota to 2GB ( quota type FILESET ). STAGING-itoufiqu at crspoitdc-mgmt-001:/mmfs1/crsp_test/itoufiqu$ df -h /mmfs1/crsp_test/ Filesystem? ? ? Size? Used Avail Use% Mounted on mmfs1? ? ? ? ? ?2.0G? ? ?0? 2.0G? ?0% /mmfs1 Now, I go create a 'dependent' fileset calld 'itoufiqu' under 'crsp_test' , sharing it's inode space, and i was able to set it's quota to 4GB. STAGING-root at crspoitdc-mgmt-001:/mmfs1/crsp_test$ df -h /mmfs1/crsp_test/itoufiqu Filesystem? ? ? Size? Used Avail Use% Mounted on mmfs1? ? ? ? ? ?4.0G? 4.0G? ? ?0 100% /mmfs1 Now, i assume that setting quota of 4GB ( whereas the independent fileset quota is 2GB ) for the above dependent fileset ( 'itoufiqu' ) is being allowed as dependent fileset is sharing inode space from the independent fileset. Is there a way to setup an independent fileset so that it's dependent filesets cannot exceed its quota limit? Another words, if my independent fileset quota is 2GB, I should not be allowed to set quotas for it's dependent filesets more then 2GB? ( for the dependent filesets created in aggregate ) ? Thanks for your help! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zufu2dI7tmWy-NT3JtWeBLKdOh7kh4HI2I8z4NyIRkc&s=IrYnLhlxx4D2HcHgbdFkE1S4Rmo3mFX9Q0TmnBd6iYg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 19326261.gif Type: image/gif Size: 4659 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From heiner.billich at psi.ch Tue Feb 12 17:45:25 2019 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Tue, 12 Feb 2019 17:45:25 +0000 Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP Message-ID: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> Hello, Can I change the preferred server for a ces address without actually moving the IP? In my case the IP already moved to the new server due to a failure on a second server. Now I would like the IP to stay even if the other server gets active again: I first want to move a test address only. But ?mmces address move? denies to run as the address already is on the server I want to make the preferred one. I also didn?t find where this address assignment is stored, I searched in the files available from ccr. Thank you, Heiner -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From sannaik2 at in.ibm.com Tue Feb 12 19:50:52 2019 From: sannaik2 at in.ibm.com (Sandeep Naik1) Date: Wed, 13 Feb 2019 01:20:52 +0530 Subject: [gpfsug-discuss] Unbalanced pdisk free space In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267DF159@MBX114.d.ethz.ch>, <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch> Message-ID: Hi Alvise, Here is response to your question in blue. Q - Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? Yes it is normal to see variation in free space between pdisks. The variation should be seen in the context of used space and not free space. GNR try to balance space equally across enclosures (failure groups). One enclosure has one SSD (per RG) so it has 41 disk in DA1 while the other one has 42. Enclosure with 42 disk show 360 GiB free space while one with 41 disk show 120 GiB. If you look at used capacity and distribute it equally between two enclosures you will notice that used capacity is almost same between two enclosure. 42 * (10240 - 360) ? 41 * (10240 - 120) I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? You should see in context of size of pdisk, which in your case in 10TB. The disk showing 120GB free is 98% full while the one showing 360GB free is 96% full. This free space is available for creating vdisks and should not be confused with free space available in filesystem. Your pdisk are by and large equally filled so there will be no impact on write performance because of small variation in free space. Hope this helps Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: "Dorigo Alvise (PSI)" To: gpfsug main discussion list Date: 31/01/2019 04:07 PM Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org They're attached. Thanks! Alvise From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of IBM Spectrum Scale [scale at us.ibm.com] Sent: Wednesday, January 30, 2019 9:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Alvise, Could you send us the output of the following commands from both server nodes. mmfsadm dump nspdclient > /tmp/dump_nspdclient. mmfsadm dump pdisk > /tmp/dump_pdisk. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2019 08:24 AM Subject: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, I've a Lenovo Spectrum Scale system DSS-G220 (software dss-g-2.0a) composed of 2x x3560 M5 IO server nodes 1x x3550 M5 client/support node 2x disk enclosures D3284 GPFS/GNR 4.2.3-7 Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? Should I open a PMR to IBM ? Many thanks, Alvise [attachment "rg1" deleted by Brian Herr/Poughkeepsie/IBM] [attachment "rg2" deleted by Brian Herr/Poughkeepsie/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "dump_nspdclient.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_nspdclient.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=rrqeq4UVHOFW9aaiAj-N7Lu6Z7UKBo4-0e3yINS47W0&s=n2t4qaUh-0mamutSSx0E-5j09DbZImKsbDoiM0enBcg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Wed Feb 13 08:30:47 2019 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Wed, 13 Feb 2019 08:30:47 +0000 Subject: [gpfsug-discuss] Unbalanced pdisk free space In-Reply-To: References: <83A6EEB0EC738F459A39439733AE8045267DF159@MBX114.d.ethz.ch>, <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch>, Message-ID: <83A6EEB0EC738F459A39439733AE8045267E81EC@MBX114.d.ethz.ch> Thank you, I've understood the math and the focus from free space to used one. The only thing the remain strange for me is that I've not seen something like this in other systems (IBM ESS GL2 and another Lenovo G240 and G260), but I guess that the reason could be that they have much less used space, and allocated vdisks. thanks, Alvise ________________________________ From: Sandeep Naik1 [sannaik2 at in.ibm.com] Sent: Tuesday, February 12, 2019 8:50 PM To: gpfsug main discussion list; Dorigo Alvise (PSI) Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Hi Alvise, Here is response to your question in blue. Q - Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? Yes it is normal to see variation in free space between pdisks. The variation should be seen in the context of used space and not free space. GNR try to balance space equally across enclosures (failure groups). One enclosure has one SSD (per RG) so it has 41 disk in DA1 while the other one has 42. Enclosure with 42 disk show 360 GiB free space while one with 41 disk show 120 GiB. If you look at used capacity and distribute it equally between two enclosures you will notice that used capacity is almost same between two enclosure. 42 * (10240 - 360) ? 41 * (10240 - 120) I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? You should see in context of size of pdisk, which in your case in 10TB. The disk showing 120GB free is 98% full while the one showing 360GB free is 96% full. This free space is available for creating vdisks and should not be confused with free space available in filesystem. Your pdisk are by and large equally filled so there will be no impact on write performance because of small variation in free space. Hope this helps Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: "Dorigo Alvise (PSI)" To: gpfsug main discussion list Date: 31/01/2019 04:07 PM Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ They're attached. Thanks! Alvise ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of IBM Spectrum Scale [scale at us.ibm.com] Sent: Wednesday, January 30, 2019 9:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Alvise, Could you send us the output of the following commands from both server nodes. * mmfsadm dump nspdclient > /tmp/dump_nspdclient. * mmfsadm dump pdisk > /tmp/dump_pdisk. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2019 08:24 AM Subject: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, I've a Lenovo Spectrum Scale system DSS-G220 (software dss-g-2.0a) composed of 2x x3560 M5 IO server nodes 1x x3550 M5 client/support node 2x disk enclosures D3284 GPFS/GNR 4.2.3-7 Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? Should I open a PMR to IBM ? Many thanks, Alvise [attachment "rg1" deleted by Brian Herr/Poughkeepsie/IBM] [attachment "rg2" deleted by Brian Herr/Poughkeepsie/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "dump_nspdclient.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_nspdclient.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From nico.faerber at id.unibe.ch Fri Feb 15 11:59:43 2019 From: nico.faerber at id.unibe.ch (nico.faerber at id.unibe.ch) Date: Fri, 15 Feb 2019 11:59:43 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI Message-ID: Dear all, We see some outdated events ("longwaiters_warn" and "gpfs_warn" for component GPFS with an age of 2 weeks) for some nodes that are resolved in in the meantime ("mmhealth node show" on the affected nodes report HEALTHY for component GPFS). How can I remove those outdated event logs from the GUI? Is there a button/command or do I have to manually delete some records in the database? If yes, what is the recommended procedure? We are running: Cluster minimum release level: 4.2.3.0 GUI release level: 5.0.2-1 Thank you. Best, Nico Universitaet Bern Abt. Informatikdienste Nico F?rber High Performance Computing Gesellschaftsstrasse 6 CH-3012 Bern Raum 104 Tel. +41 (0)31 631 51 89 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Ritter at de.ibm.com Fri Feb 15 12:22:31 2019 From: Matthias.Ritter at de.ibm.com (Matthias Ritter) Date: Fri, 15 Feb 2019 12:22:31 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579920.png Type: image/png Size: 1167 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579921.png Type: image/png Size: 6645 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579922.png Type: image/png Size: 1167 bytes Desc: not available URL: From nico.faerber at id.unibe.ch Fri Feb 15 14:05:01 2019 From: nico.faerber at id.unibe.ch (nico.faerber at id.unibe.ch) Date: Fri, 15 Feb 2019 14:05:01 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI In-Reply-To: References: Message-ID: <46BF617B-D84D-4D80-8C97-506243DFCAF5@id.unibe.ch> Dear Mr. Ritter, It worked. The stale events are gone. Thank you very much. Best, Nico --- Universit?t Bern Informatikdienste Gruppe Systemdienste Nico F?rber Systemadministrator HPC Hochschulstrasse 6 CH-3012 Bern Tel. +41 (0)31 631 51 89 mailto: grid-support at id.unibe.ch http://www.id.unibe.ch/ Von: im Auftrag von Matthias Ritter Antworten an: gpfsug main discussion list Datum: Freitag, 15. Februar 2019 um 13:22 An: "gpfsug-discuss at spectrumscale.org" Cc: "gpfsug-discuss at spectrumscale.org" Betreff: Re: [gpfsug-discuss] Clear old/stale events in GUI Hello Mr. F?rber, please run on each GUI node you have the following command: /usr/lpp/mmfs/gui/cli/lshealth --reset This should help clearing this stale events not shown by mmhealth. Mit freundlichen Gr??en / Kind regards [cid:155021689579920] [IBM Spectrum Scale] * Matthias Ritter Spectrum Scale GUI Development Department M069 / Spectrum Scale Software Development +49-7034-2744-1977 Matthias.Ritter at de.ibm.com [cid:155021689579922] IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Matthias Hartmann Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 ----- Urspr?ngliche Nachricht ----- Von: Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: CC: Betreff: [gpfsug-discuss] Clear old/stale events in GUI Datum: Fr, 15. Feb 2019 13:14 Dear all, We see some outdated events ("longwaiters_warn" and "gpfs_warn" for component GPFS with an age of 2 weeks) for some nodes that are resolved in in the meantime ("mmhealth node show" on the affected nodes report HEALTHY for component GPFS). How can I remove those outdated event logs from the GUI? Is there a button/command or do I have to manually delete some records in the database? If yes, what is the recommended procedure? We are running: Cluster minimum release level: 4.2.3.0 GUI release level: 5.0.2-1 Thank you. Best, Nico Universitaet Bern Abt. Informatikdienste Nico F?rber High Performance Computing Gesellschaftsstrasse 6 CH-3012 Bern Raum 104 Tel. +41 (0)31 631 51 89 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1168 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 6646 bytes Desc: image002.png URL: From Kevin.Buterbaugh at Vanderbilt.Edu Fri Feb 15 15:10:57 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Fri, 15 Feb 2019 15:10:57 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Message-ID: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.knister at gmail.com Sun Feb 17 14:26:23 2019 From: aaron.knister at gmail.com (Aaron Knister) Date: Sun, 17 Feb 2019 09:26:23 -0500 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu> wrote: > Hi All, > > Been reading man pages, docs, and Googling, and haven?t found a definitive > answer to this question, so I knew exactly where to turn? ;-) > > I?m dealing with some slow I/O?s to certain storage arrays in our > environments ? like really, really slow I/O?s ? here?s just one example > from one of my NSD servers of a 10 second I/O: > > 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv > dm-92 > > So here?s my question ? when mmdiag ?iohist tells me that that I/O took > slightly over 10 seconds, is that: > > 1. The time from when the NSD server received the I/O request from the > client until it shipped the data back onto the wire towards the client? > 2. The time from when the client issued the I/O request until it received > the data back from the NSD server? > 3. Something else? > > I?m thinking it?s #1, but want to confirm. Which one it is has very > obvious implications for our troubleshooting steps. Thanks in advance? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Feb 17 18:13:17 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sun, 17 Feb 2019 23:43:17 +0530 Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP In-Reply-To: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> References: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> Message-ID: @Frank, Can you please help with the below query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Billich Heinrich Rainer (PSI)" To: gpfsug main discussion list Date: 02/12/2019 11:18 PM Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, Can I change the preferred server for a ces address without actually moving the IP? In my case the IP already moved to the new server due to a failure on a second server. Now I would like the IP to stay even if the other server gets active again: I first want to move a test address only. But ?mmces address move? denies to run as the address already is on the server I want to make the preferred one. I also didn?t find where this address assignment is stored, I searched in the files available from ccr. Thank you, Heiner -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=AfpcM3p1Ru44FyaSyGfml_GFX4T4mQGuaGNURp8MUSI&s=CaYKqK4hj0eunF_WiOWve6Iq3C4aqqSIV0xxDEM8zAQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Feb 17 19:01:24 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Mon, 18 Feb 2019 00:31:24 +0530 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zgeKdB1auU2SQrpQXrxc88rzoAWczKl_H7fqsgwqpv0&s=vbOLNkf-Y_NBNABzd8Enw14ykpYN2q5SoQLkAKiGIrU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Sun Feb 17 18:59:37 2019 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 17 Feb 2019 10:59:37 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <447A32D4-5B23-47F3-B55C-6B51D411BD67@gmail.com> If you run it on the client, it includes local queuing, network as well as NSD Server processing and the actual device I/O time. if issued on the NSD Server it contains processing and I/O time, the processing shouldn?t really add any overhead but in some cases I have seen it contributing. If you corelate the client and server iohist outputs you can find the server entry based on the tags in the iohist output, this allows you to see exactly how much time was spend on network vs on the server to rule out network as the problem. Sven From: on behalf of Aaron Knister Reply-To: gpfsug main discussion list Date: Sunday, February 17, 2019 at 6:26 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scrusan at ddn.com Mon Feb 18 02:48:22 2019 From: scrusan at ddn.com (Steve Crusan) Date: Mon, 18 Feb 2019 02:48:22 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: , Message-ID: Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. >From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Feb 19 19:46:36 2019 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 19 Feb 2019 11:46:36 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <6884F598-9039-4163-BD56-7D9E0C815044@gmail.com> Just to add a bit more details to that, If you want to track down an individual i/o or all i/o to a particular file you can do this with mmfsadm dump iohist (mmdiag doesn?t give you all you need) : so run /usr/lpp/mmfs/bin/mmfsadm dump iohist >iohist on server as well as client : I/O history: I/O start time RW??? Buf type disk:sectorNum???? nSec? time ms????? tag1???????? tag2?????????? Disk UID typ??????? NSD node?? context thread?????????????????????????? comment --------------- -- ----------- ----------------- -----? ------- --------- ------------ ------------------ --- --------------- --------- -------------------------------- ------- 12:22:41.880663? W??????? data??? 1:5602050048?? 32768? 927.272 258249737????????? 900? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.038653? W??????? data??? 4:5815107584?? 32768? 803.106 258249737????????? 903? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.504966? W??????? data??? 3:695664640??? 32768? 375.272 258249737????????? 918? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.592712? W??????? data??? 1:1121779712?? 32768? 311.026 258249737????????? 920? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.641689? W??????? data??? 2:1334837248?? 32768? 350.373 258249737????????? 921? 0A2405A0:5C6BD9B4 cli? 172.16.254.161 Prefetch? WritebehindWorkerThread 12:22:42.301120? W??????? data??? 1:6667337728?? 32768? 758.629 258249737????????? 912? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.176365? W??????? data??? 1:6241222656?? 32768? 895.423 258249737??????? ??908? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.283152? W??????? data??? 4:6454280192?? 32768? 840.528 258249737????????? 911? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.149964? W??????? data??? 4:6028165120?? 32768? 981.661 258249737????????? 907? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.130402? W??????? data??? 3:6028165120?? 32768 1021.175 258249737????????? 906? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.838850? W??????? data??? 2:1867481088?? 32768? 343.912 258249737????????? 925? 0A2405A0:5C6BD9B4 cli? 172.16.254.161 Prefetch? WritebehindWorkerThread 12:22:42.841800? W??????? data??? 3:1867481088?? 32768? 397.089 258249737????????? 926? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.652912? W??????? data??? 3:1334837248?? 32768? 637.628 258249737????????? 922? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.883946? W??????? data??? 1:1974009856?? 32768? 442.953 258249737????????? 928? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.903782? W??????? data??? 3:1974009856?? 32768? 424.285 258249737??????? ??930? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.329905? W??????? data??? 4:269549568??? 32768 1061.313 258249737????????? 915? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.392467? W??????? data??? 1:376078336??? 32768? 998.770 258249737????????? 916? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread in this example I only care about one file with inode number 258249737 (which is stored in tag1) : Now simply on the server run : grep '258249737' iohist? 19:22:42.533259? W??????? data??? 1:5602050048?? 32768? 283.016 258249737????????? 900? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.604062? W??????? data??? 1:1121779712?? 32768? 308.015 258249737????????? 920? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.751549? W??????? data??? 1:6667337728?? 32768? 316.536 258249737??? ??????912? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.722716? W??????? data??? 1:6241222656?? 32768? 357.409 258249737????????? 908? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:43.030353? W??????? data??? 1:1974009856?? 32768? 304.887 258249737????????? 928? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:43.103745? W??????? data??? 1:376078336??? 32768? 295.835 258249737????????? 916? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread So you can now see all the blocks of that file (tag 2) that went to this particular nsd server and how much time they took to issue against the media . so for each tag1:tag2 pair on the client you find the corresponding on the server. If you subtract time of server from time of client for each line you get network/client delays . Sven From: on behalf of Steve Crusan Reply-To: gpfsug main discussion list Date: Tuesday, February 19, 2019 at 12:29 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. >From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkavitsky at 23andme.com Tue Feb 19 21:02:51 2019 From: jkavitsky at 23andme.com (Jim Kavitsky) Date: Tue, 19 Feb 2019 13:02:51 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> Steve, Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that? -jimk > On Feb 17, 2019, at 6:48 PM, Steve Crusan wrote: > > Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. > > From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. > > For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). > > Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. > > What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. > > I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. > > -Steve > From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister > Sent: Sunday, February 17, 2019 8:26:23 AM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output > > Hi Kevin, > > It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. > > Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). > > 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 > 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) > 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 > > 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 > 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 > 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 > 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 > 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 > 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 > 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 > 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC > > 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 > > 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None > 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 > 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 > 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 > > 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 > 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) > > I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. > > -Aaron > > On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: > Hi All, > > Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) > > I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: > > 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 > > So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: > > 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? > 2. The time from when the client issued the I/O request until it received the data back from the NSD server? > 3. Something else? > > I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Feb 20 16:52:28 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 20 Feb 2019 16:52:28 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> References: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> Message-ID: Hi Jim, Please see: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/NSD%20Server%20Tuning Yes, those tuning parameters will show up in the mmlsconfig / mmdiag ?config output. HTH? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 19, 2019, at 3:02 PM, Jim Kavitsky > wrote: Steve, Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that? -jimk On Feb 17, 2019, at 6:48 PM, Steve Crusan > wrote: Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Aaron Knister > Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd18386b226474395328208d696ada1a9%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636862069849536067&sdata=O5p52oLmSxQMWo2wwkVx8Z%2FapYpsAU9lAJ2cKvB095c%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Feb 19 20:26:31 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 19 Feb 2019 20:26:31 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components: 1) the NSD servers 2) the SAN fabric 3) the storage arrays I?ve been looking at all of the above and none of them are showing any obvious problems. I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue. And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output): Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 73, total processed 4859732 mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue: Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 71, total processed 2332966 mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31 So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value. Question: are those non-zero ?highest pending? values something to be concerned about? I have the following thread-related parameters set: [common] maxReceiverThreads 12 nsdMaxWorkerThreads 640 nsdThreadsPerQueue 4 nsdSmallThreadRatio 3 workerThreads 128 [serverLicense] nsdMaxWorkerThreads 1024 nsdThreadsPerQueue 12 nsdSmallThreadRatio 1 pitWorkerThreadsPerNode 3 workerThreads 1024 Also, at the top of the ?mmfsadm saferdump nsd? output I see: Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216 nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 Question: is the fact that 1008 is pretty close to 1024 a concern? Anything jump out at anybody? I don?t mind sharing full output, but it is rather lengthy. Is this worthy of a PMR? Thanks! -- Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale > wrote: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Feb 19 20:26:31 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 19 Feb 2019 20:26:31 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components: 1) the NSD servers 2) the SAN fabric 3) the storage arrays I?ve been looking at all of the above and none of them are showing any obvious problems. I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue. And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output): Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 73, total processed 4859732 mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue: Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 71, total processed 2332966 mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31 So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value. Question: are those non-zero ?highest pending? values something to be concerned about? I have the following thread-related parameters set: [common] maxReceiverThreads 12 nsdMaxWorkerThreads 640 nsdThreadsPerQueue 4 nsdSmallThreadRatio 3 workerThreads 128 [serverLicense] nsdMaxWorkerThreads 1024 nsdThreadsPerQueue 12 nsdSmallThreadRatio 1 pitWorkerThreadsPerNode 3 workerThreads 1024 Also, at the top of the ?mmfsadm saferdump nsd? output I see: Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216 nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 Question: is the fact that 1008 is pretty close to 1024 a concern? Anything jump out at anybody? I don?t mind sharing full output, but it is rather lengthy. Is this worthy of a PMR? Thanks! -- Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale > wrote: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 21 12:10:41 2019 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 21 Feb 2019 14:10:41 +0200 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Feb 21 12:23:32 2019 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 21 Feb 2019 12:23:32 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu>, Message-ID: An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Thu Feb 21 12:54:20 2019 From: jjdoherty at yahoo.com (Jim Doherty) Date: Thu, 21 Feb 2019 12:54:20 +0000 (UTC) Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Message-ID: <1280046520.2777074.1550753660364@mail.yahoo.com> Are all of the slow IOs from the same NSD volumes???? You could run an mmtrace and take an internaldump and open a ticket to the Spectrum Scale queue.? You may want to limit the run to just your nsd servers and not all nodes like I use in my example.???? Or one of the tools we use to review a trace is available in /usr/lpp/mmfs/samples/debugtools/trsum.awk?? and you can run it passing in the uncompressed trace file and redirect standard out to a file.???? If you search for ' total '? in the trace you will find the different sections,? or you can just grep ' total IO ' trsum.out? | grep duration? to get a quick look per LUN. mmtracectl --set --trace=def --tracedev-write-mode=overwrite --tracedev-overwrite-buffer-size=500M -N all mmtracectl --start -N all ; sleep 30 ; mmtracectl --stop -N all? ; mmtracectl --off -N all mmdsh -N all "/usr/lpp/mmfs/bin/mmfsadm dump all >/tmp/mmfs/service.dumpall.\$(hostname)" Jim On Thursday, February 21, 2019, 7:23:46 AM EST, Frederick Stock wrote: Kevin I'm assuming you have seen the article on IBM developerWorks about the GPFS NSD queues.? It provides useful background for analyzing the dump nsd information.? Here I'll list some thoughts for items that you can investigate/consider.?If your NSD servers are doing both large (greater than 64K) and small (64K or less) IOs then you want to have the nsdSmallThreadRatio set to 1 as it seems you do for the NSD servers.? This provides an equal number of SMALL and LARGE NSD queues.? You can also increase the total number of queues (currently 256) but I cannot determine if that is necessary from the data you provided.? Only on rare occasions have I seen a need to increase the number of queues.?The fact that you have 71 highest pending on your LARGE queues and 73 highest pending on your SMALL queues would imply your IOs are queueing for a good while either waiting for resources in GPFS or waiting for IOs to complete.? Your maximum buffer size is 16M which is defined to be the largest IO that can be requested by GPFS.? This is the buffer size that GPFS will use for LARGE IOs.? You indicated you had sufficient memory on the NSD servers but what is the value for the pagepool on those servers, and what is the value of the nsdBufSpace parameter??? If the NSD server is just that then usually nsdBufSpace is set to 70.? The IO buffers used by the NSD server come from the pagepool so you need sufficient space there for the maximum number of LARGE IO buffers that would be used concurrently by GPFS or threads will need to wait for those buffers to become available.? Essentially you want to ensure you have sufficient memory for the maximum number of IOs all doing a large IO and that value being less than 70% of the pagepool size.?You could look at the settings for the FC cards to ensure they are configured to do the largest IOs possible.? I forget the actual values (have not done this for awhile) but there are settings for the adapters that control the maximum IO size that will be sent.? I think you want this to be as large as the adapter can handle to reduce the number of messages needed to complete the large IOs done by GPFS.??Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com?? ----- Original message ----- From: "Buterbaugh, Kevin L" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Date: Thu, Feb 21, 2019 6:39 AM ? Hi All,?My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. ?You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components:?1) the NSD servers2) the SAN fabric3) the storage arrays?I?ve been looking at all of the above and none of them are showing any obvious problems. ?I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. ?Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. ?But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue.?And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. ?Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. ?But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. ?Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output):?? ? Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0? ? ?requests pending 0, highest pending 73, total processed 4859732? ? ?mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29?And for a LARGE queue:?? ? Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0? ? ?requests pending 0, highest pending 71, total processed 2332966? ? ?mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31?So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value.?Question: ?are those non-zero ?highest pending? values something to be concerned about??I have the following thread-related parameters set:?[common]maxReceiverThreads 12nsdMaxWorkerThreads 640nsdThreadsPerQueue 4nsdSmallThreadRatio 3workerThreads 128?[serverLicense]nsdMaxWorkerThreads 1024nsdThreadsPerQueue 12nsdSmallThreadRatio 1pitWorkerThreadsPerNode 3workerThreads 1024?Also, at the top of the ?mmfsadm saferdump nsd? output I see:?Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024?Question: ?is the fact that 1008 is pretty close to 1024 a concern??Anything jump out at anybody? ?I don?t mind sharing full output, but it is rather lengthy. ?Is this worthy of a PMR??Thanks!?--Kevin Buterbaugh - Senior System AdministratorVanderbilt University - Advanced Computing Center for Research and EducationKevin.Buterbaugh at vanderbilt.edu?- (615)875-9633? On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale wrote:?Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of ?Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact ?1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: ? ? ? ?"Buterbaugh, Kevin L" To: ? ? ? ?gpfsug main discussion list Date: ? ? ? ?02/16/2019 08:18 PM Subject: ? ? ? ?[gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: ? ? ? ?gpfsug-discuss-bounces at spectrumscale.org Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 ?W ? ? ? ?data ? 30:41615622144 ? 2048 10115.192 ?srv ? dm-92 ? ? ? ? ? ? ? ? ? So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. ?The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. ?The time from when the client issued the I/O request until it received the data back from the NSD server? 3. ?Something else? I?m thinking it?s #1, but want to confirm. ?Which one it is has very obvious implications for our troubleshooting steps. ?Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Feb 26 12:38:11 2019 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 26 Feb 2019 12:38:11 +0000 Subject: [gpfsug-discuss] Save the date: US User Group meeting April 16-17th, NCAR Boulder CO Message-ID: It?s coming up fast - mark your calendar if plan on attending. We?ll be publishing detailed agenda information and registration soon. If you?d like to present, please drop me a note. We have a limited number of slots available. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.holliday at crick.ac.uk Tue Feb 26 10:45:32 2019 From: michael.holliday at crick.ac.uk (Michael Holliday) Date: Tue, 26 Feb 2019 10:45:32 +0000 Subject: [gpfsug-discuss] relion software using GPFS storage Message-ID: Hi All, We've recently had an issue where a job on our client GPFS cluster caused out main storage to go extremely slowly. The job was running relion using MPI (https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page) It caused waiters across the cluster, and caused the load to spike on NSDS on at a time. When the spike ended on one NSD, it immediately started on another. There were no obvious errors in the logs and the issues cleared immediately after the job was cancelled. Has anyone else see any issues with relion using GPFS storage? Michael Michael Holliday RITTech MBCS Senior HPC & Research Data Systems Engineer | eMedLab Operations Team Scientific Computing STP | The Francis Crick Institute 1, Midland Road | London | NW1 1AT | United Kingdom Tel: 0203 796 3167 The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at strubi.ox.ac.uk Wed Feb 27 12:49:38 2019 From: robert at strubi.ox.ac.uk (Robert Esnouf) Date: Wed, 27 Feb 2019 12:49:38 +0000 Subject: [gpfsug-discuss] relion software using GPFS storage In-Reply-To: References: Message-ID: <9aee9d18ed77ad61c1b44859703f2284@strubi.ox.ac.uk> Dear Michael, There are settings within relion for parallel file systems, you should check they are enabled if you have SS underneath. Otherwise, check which version of relion and then try to understand the problem that is being analysed a little more. If the box size is very small and the internal symmetry low then the user may read 100,000s of small "picked particle" files for each iteration opening and closing the files each time. I believe that relion3 has some facility for extracting these small particles from the larger raw images and that is more SS-friendly. Alternatively, the size of the set of picked particles is often only in 50GB range and so staging to one or more local machines is quite feasible... Hope one of those suggestions helps. Regards, Robert -- Dr Robert Esnouf University Research Lecturer, Director of Research Computing BDI, Head of Research Computing Core WHG, NDM Research Computing Strategy Officer Main office: Room 10/028, Wellcome Centre for Human Genetics, Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK Emails: robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at bdi.ox.ac.uk Tel: (+44)-1865-287783 (WHG); (+44)-1865-743689 (BDI) ? -----Original Message----- From: "Michael Holliday" To: gpfsug-discuss at spectrumscale.org Date: 27/02/19 12:21 Subject: [gpfsug-discuss] relion software using GPFS storage Hi All, ? We?ve recently had an issue where a job on our client GPFS cluster caused out main storage to go extremely slowly.? ?The job was running relion using MPI (https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page) ? It caused waiters across the cluster, and caused the load to spike on NSDS on at a time.? When the spike ended on one NSD, it immediately started on another.? ? There were no obvious errors in the logs and the issues cleared immediately after the job was cancelled.? ? Has anyone else see any issues with relion using GPFS storage? ? Michael ? Michael Holliday RITTech MBCS Senior HPC & Research Data Systems Engineer | eMedLab Operations Team Scientific Computing STP | The Francis Crick Institute 1, Midland Road | London | NW1 1AT | United Kingdom Tel: 0203 796 3167 ? The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Feb 27 20:12:54 2019 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 27 Feb 2019 20:12:54 +0000 Subject: [gpfsug-discuss] Registration now open! - US User Group Meeting, April 16-17th, NCAR Boulder Message-ID: <671D229B-C7A1-459D-A42B-DB93502F59FA@nuance.com> Registration is now open: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2019-meeting-tickets-57035376346 Please note that agenda details are not set yet but these will be finalized in the next few weeks - when they are I will post to the registration page and the mailing list. - April 15th: Informal social gather on Monday for those arriving early (location TBD) - April 16th: Full day of talks from IBM and the user community, Social and Networking Event (details TBD) - April 17th: Talks and breakout sessions (If you have any topics for the breakout sessions, let us know) Looking forward to seeing everyone in Boulder! Bob Oesterlin/Kristy Kallback-Rose -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.dietrich at desy.de Thu Feb 28 07:56:56 2019 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Thu, 28 Feb 2019 08:56:56 +0100 (CET) Subject: [gpfsug-discuss] CES Ganesha netgroup caching? Message-ID: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> Hi, I am currently playing around with LDAP netgroups for NFS exports via CES. However, I could not figure out how long Ganesha is caching the netgroup entries? There is definitely some caching, as adding a host to the netgroup does not immediately grant access to the share. A "getent netgroup " on the CES node returns the correct result, so this is not some other caching effect. Resetting the cache via "ganesha_mgr purge netgroup" works, but is probably not officially supported. The CES nodes are running with GPFS 5.0.2.3 and gpfs.nfs-ganesha-2.5.3-ibm030.01.el7. CES authentication is set to user-defined, the nodes just use SSSD with a rfc2307bis LDAP server. Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ From mnaineni at in.ibm.com Thu Feb 28 12:33:50 2019 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Thu, 28 Feb 2019 12:33:50 +0000 Subject: [gpfsug-discuss] CES Ganesha netgroup caching? In-Reply-To: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> References: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> Message-ID: An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Tue Feb 5 09:47:39 2019 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 5 Feb 2019 09:47:39 +0000 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Message-ID: Hello Odd. Apparently i have issues posting to the list again. Sorry if this comes in double. I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Feb 4 19:12:29 2019 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 4 Feb 2019 19:12:29 +0000 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Message-ID: <2D68B7E7-EBDF-4FDD-BE54-202920A08595@filmlance.se> Hello I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Feb 5 20:09:07 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 5 Feb 2019 12:09:07 -0800 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? In-Reply-To: References: Message-ID: Hello Henrik, What you are seeing has to do with whether UAC (User Access Control) is enabled/disabled on Windows. On Windows 7 and 2012R2 etc, my guess is that you have disabled UAC (since that is what GPFS required in the past). When UAC is disabled, the default owner of a local file/dir created by a user that is member of Administrators group, is set as Administrators (SID = S-1-5-32-544). That is mapped to autogenerated-id 15,000,000 in your case. On Windows 10 (where UAC MUST stay enabled), the behavior changes. When UAC is not disabled (and NOT running elevated), the default owner of a local file/dir created by a user that is member of Administrators group, is set to that user SID. Hence, it is not S-1-5-32-544, rather a unique SID for that local user. In absence of AD setup and RFC 2307 mappings, GPFS is auto-mapping that user SID to 15,000, 270 in your case. As you see, the state of UAC results in different owners. You simply cannot disable UAC on Windows 10 (and newer versions) since it breaks certain OS components! Hence, to get consistent behavior (the latter semantics where file owner = user SID), you could enable UAC on Windows 7/2012R2 to default (instead of disabling it). GPFS 4.2.3.12 works with UAC enabled. Remember though that the old 15,000,000 is on-disk ACL structures, hence you will have to explicitly set/change owner to yourself (to update to 15,000,270) for existing files. Any new files/dirs though should default to 15,000,270. You could also add an ACL entry for Administrators group or individual users granting desired access instead of relying on file ownership for access rights. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Henrik Cednert (Filmlance)" To: gpfsug main discussion list Date: 02/05/2019 11:21 AM Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Odd. Apparently i have issues posting to the list again. Sorry if this comes in double. I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_( ?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=WEtGqEikAHptrhNUxYjEd8vfm1bPVcbCgEcMH4rp-UM&s=MeyrAfodvNKjIFQuVsfXbLlTAQvTBnUVgvNJqv901RA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Thu Feb 7 16:09:17 2019 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Thu, 07 Feb 2019 16:09:17 +0000 Subject: [gpfsug-discuss] UK Spectrum Scale User Group Sponsorship packages Message-ID: We're currently in the process of planning for the 2019 UK Spectrum Scale User Group meeting, to be held in London on 8th/9th May and will again be looking for commercial sponsorship to support the event. I'll be sending a message out to companies who have previously sponsored us with details soon, however if you would like to be contacted about the sponsorship packages, please drop me an email and I'll include your company when we send out the details. Thanks Simon From techie879 at gmail.com Sat Feb 9 01:42:13 2019 From: techie879 at gmail.com (Imam Toufique) Date: Fri, 8 Feb 2019 17:42:13 -0800 Subject: [gpfsug-discuss] question on fileset / quotas Message-ID: Hi Everyone, I am very new to GPFS, just got the system up and running, now starting to get myself setting up filesets and quotas. I have a question, may be in has been answered already, somewhere in this forum, my apologies for that if this is a repeat. My question is: lets say I have an independent fileset called '/mmfs1/crsp_test' , and I set it's quota to 2GB ( quota type FILESET ). STAGING-itoufiqu at crspoitdc-mgmt-001:/mmfs1/crsp_test/itoufiqu$ df -h /mmfs1/crsp_test/ Filesystem Size Used Avail Use% Mounted on mmfs1 2.0G 0 2.0G 0% /mmfs1 Now, I go create a 'dependent' fileset calld 'itoufiqu' under 'crsp_test' , sharing it's inode space, and i was able to set it's quota to 4GB. STAGING-root at crspoitdc-mgmt-001:/mmfs1/crsp_test$ df -h /mmfs1/crsp_test/itoufiqu Filesystem Size Used Avail Use% Mounted on mmfs1 4.0G 4.0G 0 100% /mmfs1 Now, i assume that setting quota of 4GB ( whereas the independent fileset quota is 2GB ) for the above dependent fileset ( 'itoufiqu' ) is being allowed as dependent fileset is sharing inode space from the independent fileset. Is there a way to setup an independent fileset so that it's dependent filesets cannot exceed its quota limit? Another words, if my independent fileset quota is 2GB, I should not be allowed to set quotas for it's dependent filesets more then 2GB ( for the dependent filesets created in aggregate ) ? Thanks for your help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Feb 9 04:02:49 2019 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 08 Feb 2019 23:02:49 -0500 Subject: [gpfsug-discuss] question on fileset / quotas In-Reply-To: References: Message-ID: <32045.1549684969@turing-police.cc.vt.edu> On Fri, 08 Feb 2019 17:42:13 -0800, Imam Toufique said: > Is there a way to setup an independent fileset so that it's dependent > filesets cannot exceed its quota limit? Another words, if my independent > fileset quota is 2GB, I should not be allowed to set quotas for it's > dependent filesets more then 2GB ( for the dependent filesets created in > aggregate ) ? Well.. to set the quota on the dependent fileset, you have to be root. And the general Unix/Linux philosophy is to not prevent the root user from doing things unless there's a good technical reason(*). There's a lot of "here be dragons" corner cases - for instance, if I create /gpfs/parent and give it 10T of space, does that mean that *each* dependent fileset is limited to 10T, or the *sum* has to remain under 10T? (In other words, is overcommit allowed?). There's other problems, like "Give the parent 8T, give two children 4T each, let each one store 3T, and then reduce the parent quota to 2T" - what should happen then? And quite frankly, the fact that mmrepquota has an entire column of output for "uncertain" when only dealing with *one* fileset tells me there's not sane way to avoid race conditions when dealing with two filesets without some truly performance-ruining levels of filesystem locking. So I'd say that probably, it's more reasonable to do this outside GPFS - anything from telling everybody who knows the root password not to do it, to teaching whatever automation/provisioning system you have (Ansible, etc) to enforce it. Having said that, if you can nail down the semantics and then make a good business case that it should be done inside of GPFS rather than at the sysadmin level, I'm sure IBM would be willing to at least listen to an RFE.... (*) I remember one Unix variant (Gould's UTX/32) that was perfectly willing to let C code running as root do an unlink(".") rather than return EISDIR even though it meant you just bought yourself a shiny new fsck - don't ask how I found out :) From rohwedder at de.ibm.com Mon Feb 11 09:36:06 2019 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Mon, 11 Feb 2019 10:36:06 +0100 Subject: [gpfsug-discuss] question on fileset / quotas In-Reply-To: References: Message-ID: Hello, There is no hierarchy between fileset quotas, the fileset quota limits are completely independent of each other. The independent fileset., as you mentioned, provides the common inode space and ties the parent and child together in regards to using inodes from their common inode space and for example in regards to snapshots and other features that act on independent filesets. There are however many degrees of freedom in setting up quota configurations, for example user and group quotas and the per-fileset and per-filesystem quota options. So there may be other ways how you could create rules that can model your environment and which could provide a means to create limits across several filesets. For example (will probably not match to you but just to illustrate):: You have a group of applications. Each application stores data in one dependent fileset. The filesystem where these exist uses per filesystem quota accounting.- All these filesets are children of an independent filesets. this allows you to create snapshots of all applications together. All applications store data under the same group. You can limit each applications space via fileset quota and you can limit the whole application group via group quota. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland Research & Development E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany From: Imam Toufique To: gpfsug-discuss at spectrumscale.org Date: 09.02.2019 02:42 Subject: [gpfsug-discuss] question on fileset / quotas Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Everyone, I am very new to GPFS, just got the system up and running, now starting to get myself setting up filesets and quotas.? I have a question, may be in has been answered already, somewhere in this forum, my apologies for that if this is a repeat. My question is: lets say I have an independent fileset called '/mmfs1/crsp_test' , and I set it's quota to 2GB ( quota type FILESET ). STAGING-itoufiqu at crspoitdc-mgmt-001:/mmfs1/crsp_test/itoufiqu$ df -h /mmfs1/crsp_test/ Filesystem? ? ? Size? Used Avail Use% Mounted on mmfs1? ? ? ? ? ?2.0G? ? ?0? 2.0G? ?0% /mmfs1 Now, I go create a 'dependent' fileset calld 'itoufiqu' under 'crsp_test' , sharing it's inode space, and i was able to set it's quota to 4GB. STAGING-root at crspoitdc-mgmt-001:/mmfs1/crsp_test$ df -h /mmfs1/crsp_test/itoufiqu Filesystem? ? ? Size? Used Avail Use% Mounted on mmfs1? ? ? ? ? ?4.0G? 4.0G? ? ?0 100% /mmfs1 Now, i assume that setting quota of 4GB ( whereas the independent fileset quota is 2GB ) for the above dependent fileset ( 'itoufiqu' ) is being allowed as dependent fileset is sharing inode space from the independent fileset. Is there a way to setup an independent fileset so that it's dependent filesets cannot exceed its quota limit? Another words, if my independent fileset quota is 2GB, I should not be allowed to set quotas for it's dependent filesets more then 2GB? ( for the dependent filesets created in aggregate ) ? Thanks for your help! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zufu2dI7tmWy-NT3JtWeBLKdOh7kh4HI2I8z4NyIRkc&s=IrYnLhlxx4D2HcHgbdFkE1S4Rmo3mFX9Q0TmnBd6iYg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 19326261.gif Type: image/gif Size: 4659 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From heiner.billich at psi.ch Tue Feb 12 17:45:25 2019 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Tue, 12 Feb 2019 17:45:25 +0000 Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP Message-ID: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> Hello, Can I change the preferred server for a ces address without actually moving the IP? In my case the IP already moved to the new server due to a failure on a second server. Now I would like the IP to stay even if the other server gets active again: I first want to move a test address only. But ?mmces address move? denies to run as the address already is on the server I want to make the preferred one. I also didn?t find where this address assignment is stored, I searched in the files available from ccr. Thank you, Heiner -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From sannaik2 at in.ibm.com Tue Feb 12 19:50:52 2019 From: sannaik2 at in.ibm.com (Sandeep Naik1) Date: Wed, 13 Feb 2019 01:20:52 +0530 Subject: [gpfsug-discuss] Unbalanced pdisk free space In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267DF159@MBX114.d.ethz.ch>, <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch> Message-ID: Hi Alvise, Here is response to your question in blue. Q - Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? Yes it is normal to see variation in free space between pdisks. The variation should be seen in the context of used space and not free space. GNR try to balance space equally across enclosures (failure groups). One enclosure has one SSD (per RG) so it has 41 disk in DA1 while the other one has 42. Enclosure with 42 disk show 360 GiB free space while one with 41 disk show 120 GiB. If you look at used capacity and distribute it equally between two enclosures you will notice that used capacity is almost same between two enclosure. 42 * (10240 - 360) ? 41 * (10240 - 120) I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? You should see in context of size of pdisk, which in your case in 10TB. The disk showing 120GB free is 98% full while the one showing 360GB free is 96% full. This free space is available for creating vdisks and should not be confused with free space available in filesystem. Your pdisk are by and large equally filled so there will be no impact on write performance because of small variation in free space. Hope this helps Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: "Dorigo Alvise (PSI)" To: gpfsug main discussion list Date: 31/01/2019 04:07 PM Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org They're attached. Thanks! Alvise From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of IBM Spectrum Scale [scale at us.ibm.com] Sent: Wednesday, January 30, 2019 9:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Alvise, Could you send us the output of the following commands from both server nodes. mmfsadm dump nspdclient > /tmp/dump_nspdclient. mmfsadm dump pdisk > /tmp/dump_pdisk. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2019 08:24 AM Subject: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, I've a Lenovo Spectrum Scale system DSS-G220 (software dss-g-2.0a) composed of 2x x3560 M5 IO server nodes 1x x3550 M5 client/support node 2x disk enclosures D3284 GPFS/GNR 4.2.3-7 Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? Should I open a PMR to IBM ? Many thanks, Alvise [attachment "rg1" deleted by Brian Herr/Poughkeepsie/IBM] [attachment "rg2" deleted by Brian Herr/Poughkeepsie/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "dump_nspdclient.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_nspdclient.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=rrqeq4UVHOFW9aaiAj-N7Lu6Z7UKBo4-0e3yINS47W0&s=n2t4qaUh-0mamutSSx0E-5j09DbZImKsbDoiM0enBcg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Wed Feb 13 08:30:47 2019 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Wed, 13 Feb 2019 08:30:47 +0000 Subject: [gpfsug-discuss] Unbalanced pdisk free space In-Reply-To: References: <83A6EEB0EC738F459A39439733AE8045267DF159@MBX114.d.ethz.ch>, <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch>, Message-ID: <83A6EEB0EC738F459A39439733AE8045267E81EC@MBX114.d.ethz.ch> Thank you, I've understood the math and the focus from free space to used one. The only thing the remain strange for me is that I've not seen something like this in other systems (IBM ESS GL2 and another Lenovo G240 and G260), but I guess that the reason could be that they have much less used space, and allocated vdisks. thanks, Alvise ________________________________ From: Sandeep Naik1 [sannaik2 at in.ibm.com] Sent: Tuesday, February 12, 2019 8:50 PM To: gpfsug main discussion list; Dorigo Alvise (PSI) Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Hi Alvise, Here is response to your question in blue. Q - Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? Yes it is normal to see variation in free space between pdisks. The variation should be seen in the context of used space and not free space. GNR try to balance space equally across enclosures (failure groups). One enclosure has one SSD (per RG) so it has 41 disk in DA1 while the other one has 42. Enclosure with 42 disk show 360 GiB free space while one with 41 disk show 120 GiB. If you look at used capacity and distribute it equally between two enclosures you will notice that used capacity is almost same between two enclosure. 42 * (10240 - 360) ? 41 * (10240 - 120) I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? You should see in context of size of pdisk, which in your case in 10TB. The disk showing 120GB free is 98% full while the one showing 360GB free is 96% full. This free space is available for creating vdisks and should not be confused with free space available in filesystem. Your pdisk are by and large equally filled so there will be no impact on write performance because of small variation in free space. Hope this helps Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: "Dorigo Alvise (PSI)" To: gpfsug main discussion list Date: 31/01/2019 04:07 PM Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ They're attached. Thanks! Alvise ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of IBM Spectrum Scale [scale at us.ibm.com] Sent: Wednesday, January 30, 2019 9:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Alvise, Could you send us the output of the following commands from both server nodes. * mmfsadm dump nspdclient > /tmp/dump_nspdclient. * mmfsadm dump pdisk > /tmp/dump_pdisk. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2019 08:24 AM Subject: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, I've a Lenovo Spectrum Scale system DSS-G220 (software dss-g-2.0a) composed of 2x x3560 M5 IO server nodes 1x x3550 M5 client/support node 2x disk enclosures D3284 GPFS/GNR 4.2.3-7 Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? Should I open a PMR to IBM ? Many thanks, Alvise [attachment "rg1" deleted by Brian Herr/Poughkeepsie/IBM] [attachment "rg2" deleted by Brian Herr/Poughkeepsie/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "dump_nspdclient.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_nspdclient.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From nico.faerber at id.unibe.ch Fri Feb 15 11:59:43 2019 From: nico.faerber at id.unibe.ch (nico.faerber at id.unibe.ch) Date: Fri, 15 Feb 2019 11:59:43 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI Message-ID: Dear all, We see some outdated events ("longwaiters_warn" and "gpfs_warn" for component GPFS with an age of 2 weeks) for some nodes that are resolved in in the meantime ("mmhealth node show" on the affected nodes report HEALTHY for component GPFS). How can I remove those outdated event logs from the GUI? Is there a button/command or do I have to manually delete some records in the database? If yes, what is the recommended procedure? We are running: Cluster minimum release level: 4.2.3.0 GUI release level: 5.0.2-1 Thank you. Best, Nico Universitaet Bern Abt. Informatikdienste Nico F?rber High Performance Computing Gesellschaftsstrasse 6 CH-3012 Bern Raum 104 Tel. +41 (0)31 631 51 89 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Ritter at de.ibm.com Fri Feb 15 12:22:31 2019 From: Matthias.Ritter at de.ibm.com (Matthias Ritter) Date: Fri, 15 Feb 2019 12:22:31 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579920.png Type: image/png Size: 1167 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579921.png Type: image/png Size: 6645 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579922.png Type: image/png Size: 1167 bytes Desc: not available URL: From nico.faerber at id.unibe.ch Fri Feb 15 14:05:01 2019 From: nico.faerber at id.unibe.ch (nico.faerber at id.unibe.ch) Date: Fri, 15 Feb 2019 14:05:01 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI In-Reply-To: References: Message-ID: <46BF617B-D84D-4D80-8C97-506243DFCAF5@id.unibe.ch> Dear Mr. Ritter, It worked. The stale events are gone. Thank you very much. Best, Nico --- Universit?t Bern Informatikdienste Gruppe Systemdienste Nico F?rber Systemadministrator HPC Hochschulstrasse 6 CH-3012 Bern Tel. +41 (0)31 631 51 89 mailto: grid-support at id.unibe.ch http://www.id.unibe.ch/ Von: im Auftrag von Matthias Ritter Antworten an: gpfsug main discussion list Datum: Freitag, 15. Februar 2019 um 13:22 An: "gpfsug-discuss at spectrumscale.org" Cc: "gpfsug-discuss at spectrumscale.org" Betreff: Re: [gpfsug-discuss] Clear old/stale events in GUI Hello Mr. F?rber, please run on each GUI node you have the following command: /usr/lpp/mmfs/gui/cli/lshealth --reset This should help clearing this stale events not shown by mmhealth. Mit freundlichen Gr??en / Kind regards [cid:155021689579920] [IBM Spectrum Scale] * Matthias Ritter Spectrum Scale GUI Development Department M069 / Spectrum Scale Software Development +49-7034-2744-1977 Matthias.Ritter at de.ibm.com [cid:155021689579922] IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Matthias Hartmann Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 ----- Urspr?ngliche Nachricht ----- Von: Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: CC: Betreff: [gpfsug-discuss] Clear old/stale events in GUI Datum: Fr, 15. Feb 2019 13:14 Dear all, We see some outdated events ("longwaiters_warn" and "gpfs_warn" for component GPFS with an age of 2 weeks) for some nodes that are resolved in in the meantime ("mmhealth node show" on the affected nodes report HEALTHY for component GPFS). How can I remove those outdated event logs from the GUI? Is there a button/command or do I have to manually delete some records in the database? If yes, what is the recommended procedure? We are running: Cluster minimum release level: 4.2.3.0 GUI release level: 5.0.2-1 Thank you. Best, Nico Universitaet Bern Abt. Informatikdienste Nico F?rber High Performance Computing Gesellschaftsstrasse 6 CH-3012 Bern Raum 104 Tel. +41 (0)31 631 51 89 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1168 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 6646 bytes Desc: image002.png URL: From Kevin.Buterbaugh at Vanderbilt.Edu Fri Feb 15 15:10:57 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Fri, 15 Feb 2019 15:10:57 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Message-ID: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.knister at gmail.com Sun Feb 17 14:26:23 2019 From: aaron.knister at gmail.com (Aaron Knister) Date: Sun, 17 Feb 2019 09:26:23 -0500 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu> wrote: > Hi All, > > Been reading man pages, docs, and Googling, and haven?t found a definitive > answer to this question, so I knew exactly where to turn? ;-) > > I?m dealing with some slow I/O?s to certain storage arrays in our > environments ? like really, really slow I/O?s ? here?s just one example > from one of my NSD servers of a 10 second I/O: > > 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv > dm-92 > > So here?s my question ? when mmdiag ?iohist tells me that that I/O took > slightly over 10 seconds, is that: > > 1. The time from when the NSD server received the I/O request from the > client until it shipped the data back onto the wire towards the client? > 2. The time from when the client issued the I/O request until it received > the data back from the NSD server? > 3. Something else? > > I?m thinking it?s #1, but want to confirm. Which one it is has very > obvious implications for our troubleshooting steps. Thanks in advance? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Feb 17 18:13:17 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sun, 17 Feb 2019 23:43:17 +0530 Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP In-Reply-To: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> References: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> Message-ID: @Frank, Can you please help with the below query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Billich Heinrich Rainer (PSI)" To: gpfsug main discussion list Date: 02/12/2019 11:18 PM Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, Can I change the preferred server for a ces address without actually moving the IP? In my case the IP already moved to the new server due to a failure on a second server. Now I would like the IP to stay even if the other server gets active again: I first want to move a test address only. But ?mmces address move? denies to run as the address already is on the server I want to make the preferred one. I also didn?t find where this address assignment is stored, I searched in the files available from ccr. Thank you, Heiner -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=AfpcM3p1Ru44FyaSyGfml_GFX4T4mQGuaGNURp8MUSI&s=CaYKqK4hj0eunF_WiOWve6Iq3C4aqqSIV0xxDEM8zAQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Feb 17 19:01:24 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Mon, 18 Feb 2019 00:31:24 +0530 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zgeKdB1auU2SQrpQXrxc88rzoAWczKl_H7fqsgwqpv0&s=vbOLNkf-Y_NBNABzd8Enw14ykpYN2q5SoQLkAKiGIrU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Sun Feb 17 18:59:37 2019 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 17 Feb 2019 10:59:37 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <447A32D4-5B23-47F3-B55C-6B51D411BD67@gmail.com> If you run it on the client, it includes local queuing, network as well as NSD Server processing and the actual device I/O time. if issued on the NSD Server it contains processing and I/O time, the processing shouldn?t really add any overhead but in some cases I have seen it contributing. If you corelate the client and server iohist outputs you can find the server entry based on the tags in the iohist output, this allows you to see exactly how much time was spend on network vs on the server to rule out network as the problem. Sven From: on behalf of Aaron Knister Reply-To: gpfsug main discussion list Date: Sunday, February 17, 2019 at 6:26 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scrusan at ddn.com Mon Feb 18 02:48:22 2019 From: scrusan at ddn.com (Steve Crusan) Date: Mon, 18 Feb 2019 02:48:22 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: , Message-ID: Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. >From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Feb 19 19:46:36 2019 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 19 Feb 2019 11:46:36 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <6884F598-9039-4163-BD56-7D9E0C815044@gmail.com> Just to add a bit more details to that, If you want to track down an individual i/o or all i/o to a particular file you can do this with mmfsadm dump iohist (mmdiag doesn?t give you all you need) : so run /usr/lpp/mmfs/bin/mmfsadm dump iohist >iohist on server as well as client : I/O history: I/O start time RW??? Buf type disk:sectorNum???? nSec? time ms????? tag1???????? tag2?????????? Disk UID typ??????? NSD node?? context thread?????????????????????????? comment --------------- -- ----------- ----------------- -----? ------- --------- ------------ ------------------ --- --------------- --------- -------------------------------- ------- 12:22:41.880663? W??????? data??? 1:5602050048?? 32768? 927.272 258249737????????? 900? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.038653? W??????? data??? 4:5815107584?? 32768? 803.106 258249737????????? 903? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.504966? W??????? data??? 3:695664640??? 32768? 375.272 258249737????????? 918? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.592712? W??????? data??? 1:1121779712?? 32768? 311.026 258249737????????? 920? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.641689? W??????? data??? 2:1334837248?? 32768? 350.373 258249737????????? 921? 0A2405A0:5C6BD9B4 cli? 172.16.254.161 Prefetch? WritebehindWorkerThread 12:22:42.301120? W??????? data??? 1:6667337728?? 32768? 758.629 258249737????????? 912? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.176365? W??????? data??? 1:6241222656?? 32768? 895.423 258249737??????? ??908? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.283152? W??????? data??? 4:6454280192?? 32768? 840.528 258249737????????? 911? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.149964? W??????? data??? 4:6028165120?? 32768? 981.661 258249737????????? 907? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.130402? W??????? data??? 3:6028165120?? 32768 1021.175 258249737????????? 906? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.838850? W??????? data??? 2:1867481088?? 32768? 343.912 258249737????????? 925? 0A2405A0:5C6BD9B4 cli? 172.16.254.161 Prefetch? WritebehindWorkerThread 12:22:42.841800? W??????? data??? 3:1867481088?? 32768? 397.089 258249737????????? 926? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.652912? W??????? data??? 3:1334837248?? 32768? 637.628 258249737????????? 922? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.883946? W??????? data??? 1:1974009856?? 32768? 442.953 258249737????????? 928? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.903782? W??????? data??? 3:1974009856?? 32768? 424.285 258249737??????? ??930? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.329905? W??????? data??? 4:269549568??? 32768 1061.313 258249737????????? 915? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.392467? W??????? data??? 1:376078336??? 32768? 998.770 258249737????????? 916? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread in this example I only care about one file with inode number 258249737 (which is stored in tag1) : Now simply on the server run : grep '258249737' iohist? 19:22:42.533259? W??????? data??? 1:5602050048?? 32768? 283.016 258249737????????? 900? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.604062? W??????? data??? 1:1121779712?? 32768? 308.015 258249737????????? 920? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.751549? W??????? data??? 1:6667337728?? 32768? 316.536 258249737??? ??????912? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.722716? W??????? data??? 1:6241222656?? 32768? 357.409 258249737????????? 908? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:43.030353? W??????? data??? 1:1974009856?? 32768? 304.887 258249737????????? 928? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:43.103745? W??????? data??? 1:376078336??? 32768? 295.835 258249737????????? 916? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread So you can now see all the blocks of that file (tag 2) that went to this particular nsd server and how much time they took to issue against the media . so for each tag1:tag2 pair on the client you find the corresponding on the server. If you subtract time of server from time of client for each line you get network/client delays . Sven From: on behalf of Steve Crusan Reply-To: gpfsug main discussion list Date: Tuesday, February 19, 2019 at 12:29 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. >From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkavitsky at 23andme.com Tue Feb 19 21:02:51 2019 From: jkavitsky at 23andme.com (Jim Kavitsky) Date: Tue, 19 Feb 2019 13:02:51 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> Steve, Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that? -jimk > On Feb 17, 2019, at 6:48 PM, Steve Crusan wrote: > > Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. > > From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. > > For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). > > Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. > > What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. > > I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. > > -Steve > From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister > Sent: Sunday, February 17, 2019 8:26:23 AM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output > > Hi Kevin, > > It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. > > Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). > > 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 > 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) > 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 > > 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 > 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 > 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 > 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 > 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 > 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 > 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 > 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC > > 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 > > 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None > 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 > 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 > 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 > > 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 > 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) > > I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. > > -Aaron > > On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: > Hi All, > > Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) > > I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: > > 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 > > So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: > > 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? > 2. The time from when the client issued the I/O request until it received the data back from the NSD server? > 3. Something else? > > I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Feb 20 16:52:28 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 20 Feb 2019 16:52:28 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> References: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> Message-ID: Hi Jim, Please see: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/NSD%20Server%20Tuning Yes, those tuning parameters will show up in the mmlsconfig / mmdiag ?config output. HTH? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 19, 2019, at 3:02 PM, Jim Kavitsky > wrote: Steve, Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that? -jimk On Feb 17, 2019, at 6:48 PM, Steve Crusan > wrote: Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Aaron Knister > Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd18386b226474395328208d696ada1a9%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636862069849536067&sdata=O5p52oLmSxQMWo2wwkVx8Z%2FapYpsAU9lAJ2cKvB095c%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Feb 19 20:26:31 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 19 Feb 2019 20:26:31 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components: 1) the NSD servers 2) the SAN fabric 3) the storage arrays I?ve been looking at all of the above and none of them are showing any obvious problems. I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue. And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output): Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 73, total processed 4859732 mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue: Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 71, total processed 2332966 mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31 So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value. Question: are those non-zero ?highest pending? values something to be concerned about? I have the following thread-related parameters set: [common] maxReceiverThreads 12 nsdMaxWorkerThreads 640 nsdThreadsPerQueue 4 nsdSmallThreadRatio 3 workerThreads 128 [serverLicense] nsdMaxWorkerThreads 1024 nsdThreadsPerQueue 12 nsdSmallThreadRatio 1 pitWorkerThreadsPerNode 3 workerThreads 1024 Also, at the top of the ?mmfsadm saferdump nsd? output I see: Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216 nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 Question: is the fact that 1008 is pretty close to 1024 a concern? Anything jump out at anybody? I don?t mind sharing full output, but it is rather lengthy. Is this worthy of a PMR? Thanks! -- Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale > wrote: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Feb 19 20:26:31 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 19 Feb 2019 20:26:31 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components: 1) the NSD servers 2) the SAN fabric 3) the storage arrays I?ve been looking at all of the above and none of them are showing any obvious problems. I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue. And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output): Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 73, total processed 4859732 mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue: Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 71, total processed 2332966 mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31 So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value. Question: are those non-zero ?highest pending? values something to be concerned about? I have the following thread-related parameters set: [common] maxReceiverThreads 12 nsdMaxWorkerThreads 640 nsdThreadsPerQueue 4 nsdSmallThreadRatio 3 workerThreads 128 [serverLicense] nsdMaxWorkerThreads 1024 nsdThreadsPerQueue 12 nsdSmallThreadRatio 1 pitWorkerThreadsPerNode 3 workerThreads 1024 Also, at the top of the ?mmfsadm saferdump nsd? output I see: Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216 nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 Question: is the fact that 1008 is pretty close to 1024 a concern? Anything jump out at anybody? I don?t mind sharing full output, but it is rather lengthy. Is this worthy of a PMR? Thanks! -- Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale > wrote: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 21 12:10:41 2019 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 21 Feb 2019 14:10:41 +0200 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Feb 21 12:23:32 2019 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 21 Feb 2019 12:23:32 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu>, Message-ID: An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Thu Feb 21 12:54:20 2019 From: jjdoherty at yahoo.com (Jim Doherty) Date: Thu, 21 Feb 2019 12:54:20 +0000 (UTC) Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Message-ID: <1280046520.2777074.1550753660364@mail.yahoo.com> Are all of the slow IOs from the same NSD volumes???? You could run an mmtrace and take an internaldump and open a ticket to the Spectrum Scale queue.? You may want to limit the run to just your nsd servers and not all nodes like I use in my example.???? Or one of the tools we use to review a trace is available in /usr/lpp/mmfs/samples/debugtools/trsum.awk?? and you can run it passing in the uncompressed trace file and redirect standard out to a file.???? If you search for ' total '? in the trace you will find the different sections,? or you can just grep ' total IO ' trsum.out? | grep duration? to get a quick look per LUN. mmtracectl --set --trace=def --tracedev-write-mode=overwrite --tracedev-overwrite-buffer-size=500M -N all mmtracectl --start -N all ; sleep 30 ; mmtracectl --stop -N all? ; mmtracectl --off -N all mmdsh -N all "/usr/lpp/mmfs/bin/mmfsadm dump all >/tmp/mmfs/service.dumpall.\$(hostname)" Jim On Thursday, February 21, 2019, 7:23:46 AM EST, Frederick Stock wrote: Kevin I'm assuming you have seen the article on IBM developerWorks about the GPFS NSD queues.? It provides useful background for analyzing the dump nsd information.? Here I'll list some thoughts for items that you can investigate/consider.?If your NSD servers are doing both large (greater than 64K) and small (64K or less) IOs then you want to have the nsdSmallThreadRatio set to 1 as it seems you do for the NSD servers.? This provides an equal number of SMALL and LARGE NSD queues.? You can also increase the total number of queues (currently 256) but I cannot determine if that is necessary from the data you provided.? Only on rare occasions have I seen a need to increase the number of queues.?The fact that you have 71 highest pending on your LARGE queues and 73 highest pending on your SMALL queues would imply your IOs are queueing for a good while either waiting for resources in GPFS or waiting for IOs to complete.? Your maximum buffer size is 16M which is defined to be the largest IO that can be requested by GPFS.? This is the buffer size that GPFS will use for LARGE IOs.? You indicated you had sufficient memory on the NSD servers but what is the value for the pagepool on those servers, and what is the value of the nsdBufSpace parameter??? If the NSD server is just that then usually nsdBufSpace is set to 70.? The IO buffers used by the NSD server come from the pagepool so you need sufficient space there for the maximum number of LARGE IO buffers that would be used concurrently by GPFS or threads will need to wait for those buffers to become available.? Essentially you want to ensure you have sufficient memory for the maximum number of IOs all doing a large IO and that value being less than 70% of the pagepool size.?You could look at the settings for the FC cards to ensure they are configured to do the largest IOs possible.? I forget the actual values (have not done this for awhile) but there are settings for the adapters that control the maximum IO size that will be sent.? I think you want this to be as large as the adapter can handle to reduce the number of messages needed to complete the large IOs done by GPFS.??Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com?? ----- Original message ----- From: "Buterbaugh, Kevin L" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Date: Thu, Feb 21, 2019 6:39 AM ? Hi All,?My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. ?You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components:?1) the NSD servers2) the SAN fabric3) the storage arrays?I?ve been looking at all of the above and none of them are showing any obvious problems. ?I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. ?Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. ?But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue.?And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. ?Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. ?But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. ?Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output):?? ? Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0? ? ?requests pending 0, highest pending 73, total processed 4859732? ? ?mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29?And for a LARGE queue:?? ? Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0? ? ?requests pending 0, highest pending 71, total processed 2332966? ? ?mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31?So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value.?Question: ?are those non-zero ?highest pending? values something to be concerned about??I have the following thread-related parameters set:?[common]maxReceiverThreads 12nsdMaxWorkerThreads 640nsdThreadsPerQueue 4nsdSmallThreadRatio 3workerThreads 128?[serverLicense]nsdMaxWorkerThreads 1024nsdThreadsPerQueue 12nsdSmallThreadRatio 1pitWorkerThreadsPerNode 3workerThreads 1024?Also, at the top of the ?mmfsadm saferdump nsd? output I see:?Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024?Question: ?is the fact that 1008 is pretty close to 1024 a concern??Anything jump out at anybody? ?I don?t mind sharing full output, but it is rather lengthy. ?Is this worthy of a PMR??Thanks!?--Kevin Buterbaugh - Senior System AdministratorVanderbilt University - Advanced Computing Center for Research and EducationKevin.Buterbaugh at vanderbilt.edu?- (615)875-9633? On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale wrote:?Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of ?Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact ?1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: ? ? ? ?"Buterbaugh, Kevin L" To: ? ? ? ?gpfsug main discussion list Date: ? ? ? ?02/16/2019 08:18 PM Subject: ? ? ? ?[gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: ? ? ? ?gpfsug-discuss-bounces at spectrumscale.org Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 ?W ? ? ? ?data ? 30:41615622144 ? 2048 10115.192 ?srv ? dm-92 ? ? ? ? ? ? ? ? ? So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. ?The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. ?The time from when the client issued the I/O request until it received the data back from the NSD server? 3. ?Something else? I?m thinking it?s #1, but want to confirm. ?Which one it is has very obvious implications for our troubleshooting steps. ?Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Feb 26 12:38:11 2019 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 26 Feb 2019 12:38:11 +0000 Subject: [gpfsug-discuss] Save the date: US User Group meeting April 16-17th, NCAR Boulder CO Message-ID: It?s coming up fast - mark your calendar if plan on attending. We?ll be publishing detailed agenda information and registration soon. If you?d like to present, please drop me a note. We have a limited number of slots available. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.holliday at crick.ac.uk Tue Feb 26 10:45:32 2019 From: michael.holliday at crick.ac.uk (Michael Holliday) Date: Tue, 26 Feb 2019 10:45:32 +0000 Subject: [gpfsug-discuss] relion software using GPFS storage Message-ID: Hi All, We've recently had an issue where a job on our client GPFS cluster caused out main storage to go extremely slowly. The job was running relion using MPI (https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page) It caused waiters across the cluster, and caused the load to spike on NSDS on at a time. When the spike ended on one NSD, it immediately started on another. There were no obvious errors in the logs and the issues cleared immediately after the job was cancelled. Has anyone else see any issues with relion using GPFS storage? Michael Michael Holliday RITTech MBCS Senior HPC & Research Data Systems Engineer | eMedLab Operations Team Scientific Computing STP | The Francis Crick Institute 1, Midland Road | London | NW1 1AT | United Kingdom Tel: 0203 796 3167 The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at strubi.ox.ac.uk Wed Feb 27 12:49:38 2019 From: robert at strubi.ox.ac.uk (Robert Esnouf) Date: Wed, 27 Feb 2019 12:49:38 +0000 Subject: [gpfsug-discuss] relion software using GPFS storage In-Reply-To: References: Message-ID: <9aee9d18ed77ad61c1b44859703f2284@strubi.ox.ac.uk> Dear Michael, There are settings within relion for parallel file systems, you should check they are enabled if you have SS underneath. Otherwise, check which version of relion and then try to understand the problem that is being analysed a little more. If the box size is very small and the internal symmetry low then the user may read 100,000s of small "picked particle" files for each iteration opening and closing the files each time. I believe that relion3 has some facility for extracting these small particles from the larger raw images and that is more SS-friendly. Alternatively, the size of the set of picked particles is often only in 50GB range and so staging to one or more local machines is quite feasible... Hope one of those suggestions helps. Regards, Robert -- Dr Robert Esnouf University Research Lecturer, Director of Research Computing BDI, Head of Research Computing Core WHG, NDM Research Computing Strategy Officer Main office: Room 10/028, Wellcome Centre for Human Genetics, Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK Emails: robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at bdi.ox.ac.uk Tel: (+44)-1865-287783 (WHG); (+44)-1865-743689 (BDI) ? -----Original Message----- From: "Michael Holliday" To: gpfsug-discuss at spectrumscale.org Date: 27/02/19 12:21 Subject: [gpfsug-discuss] relion software using GPFS storage Hi All, ? We?ve recently had an issue where a job on our client GPFS cluster caused out main storage to go extremely slowly.? ?The job was running relion using MPI (https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page) ? It caused waiters across the cluster, and caused the load to spike on NSDS on at a time.? When the spike ended on one NSD, it immediately started on another.? ? There were no obvious errors in the logs and the issues cleared immediately after the job was cancelled.? ? Has anyone else see any issues with relion using GPFS storage? ? Michael ? Michael Holliday RITTech MBCS Senior HPC & Research Data Systems Engineer | eMedLab Operations Team Scientific Computing STP | The Francis Crick Institute 1, Midland Road | London | NW1 1AT | United Kingdom Tel: 0203 796 3167 ? The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Feb 27 20:12:54 2019 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 27 Feb 2019 20:12:54 +0000 Subject: [gpfsug-discuss] Registration now open! - US User Group Meeting, April 16-17th, NCAR Boulder Message-ID: <671D229B-C7A1-459D-A42B-DB93502F59FA@nuance.com> Registration is now open: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2019-meeting-tickets-57035376346 Please note that agenda details are not set yet but these will be finalized in the next few weeks - when they are I will post to the registration page and the mailing list. - April 15th: Informal social gather on Monday for those arriving early (location TBD) - April 16th: Full day of talks from IBM and the user community, Social and Networking Event (details TBD) - April 17th: Talks and breakout sessions (If you have any topics for the breakout sessions, let us know) Looking forward to seeing everyone in Boulder! Bob Oesterlin/Kristy Kallback-Rose -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.dietrich at desy.de Thu Feb 28 07:56:56 2019 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Thu, 28 Feb 2019 08:56:56 +0100 (CET) Subject: [gpfsug-discuss] CES Ganesha netgroup caching? Message-ID: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> Hi, I am currently playing around with LDAP netgroups for NFS exports via CES. However, I could not figure out how long Ganesha is caching the netgroup entries? There is definitely some caching, as adding a host to the netgroup does not immediately grant access to the share. A "getent netgroup " on the CES node returns the correct result, so this is not some other caching effect. Resetting the cache via "ganesha_mgr purge netgroup" works, but is probably not officially supported. The CES nodes are running with GPFS 5.0.2.3 and gpfs.nfs-ganesha-2.5.3-ibm030.01.el7. CES authentication is set to user-defined, the nodes just use SSSD with a rfc2307bis LDAP server. Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ From mnaineni at in.ibm.com Thu Feb 28 12:33:50 2019 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Thu, 28 Feb 2019 12:33:50 +0000 Subject: [gpfsug-discuss] CES Ganesha netgroup caching? In-Reply-To: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> References: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> Message-ID: An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Tue Feb 5 09:47:39 2019 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 5 Feb 2019 09:47:39 +0000 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Message-ID: Hello Odd. Apparently i have issues posting to the list again. Sorry if this comes in double. I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Feb 4 19:12:29 2019 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 4 Feb 2019 19:12:29 +0000 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Message-ID: <2D68B7E7-EBDF-4FDD-BE54-202920A08595@filmlance.se> Hello I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Feb 5 20:09:07 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 5 Feb 2019 12:09:07 -0800 Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? In-Reply-To: References: Message-ID: Hello Henrik, What you are seeing has to do with whether UAC (User Access Control) is enabled/disabled on Windows. On Windows 7 and 2012R2 etc, my guess is that you have disabled UAC (since that is what GPFS required in the past). When UAC is disabled, the default owner of a local file/dir created by a user that is member of Administrators group, is set as Administrators (SID = S-1-5-32-544). That is mapped to autogenerated-id 15,000,000 in your case. On Windows 10 (where UAC MUST stay enabled), the behavior changes. When UAC is not disabled (and NOT running elevated), the default owner of a local file/dir created by a user that is member of Administrators group, is set to that user SID. Hence, it is not S-1-5-32-544, rather a unique SID for that local user. In absence of AD setup and RFC 2307 mappings, GPFS is auto-mapping that user SID to 15,000, 270 in your case. As you see, the state of UAC results in different owners. You simply cannot disable UAC on Windows 10 (and newer versions) since it breaks certain OS components! Hence, to get consistent behavior (the latter semantics where file owner = user SID), you could enable UAC on Windows 7/2012R2 to default (instead of disabling it). GPFS 4.2.3.12 works with UAC enabled. Remember though that the old 15,000,000 is on-disk ACL structures, hence you will have to explicitly set/change owner to yourself (to update to 15,000,270) for existing files. Any new files/dirs though should default to 15,000,270. You could also add an ACL entry for Administrators group or individual users granting desired access instead of relying on file ownership for access rights. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Henrik Cednert (Filmlance)" To: gpfsug main discussion list Date: 02/05/2019 11:21 AM Subject: [gpfsug-discuss] Mapping GPFS v5 Windows client to same UID...? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Odd. Apparently i have issues posting to the list again. Sorry if this comes in double. I?ve read a bit about this on the net but can?t wrap my tiny little brain around it. We have a bunch of windows 7 and 2012r2 clients that ran v4 previously. After a system/server upgrade of the DDN Mediascaler to 4.2.3.12 we had to upgrade those clients to v5 so that they're compatible. In addition to that a new v5 windows 10 client was deployed. The old windows 7 and 2012r2 clients writes to the system with the same UID, 15000000, but unique GID. The new client has its UID set to 12000270 and an unique GID. This cases all sorts of painful verbal and non verblam symptoms. Funny thing is that a newly added 2012r2 windows client has UID 15000000, so it?s just the windows 10 client that messes with me. All have been installed with same installers and same procedures. Since all the others can write with same UID this new windows 10 one for sure has to be able to do it as well. Or? Can someone please point me in the right direction here? Yes, I know an AD is best practice. But not possible at the moment so I?d just like to restore the same functionality that we had before upgrade. Cheers and thanks. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_( ?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=WEtGqEikAHptrhNUxYjEd8vfm1bPVcbCgEcMH4rp-UM&s=MeyrAfodvNKjIFQuVsfXbLlTAQvTBnUVgvNJqv901RA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Thu Feb 7 16:09:17 2019 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Thu, 07 Feb 2019 16:09:17 +0000 Subject: [gpfsug-discuss] UK Spectrum Scale User Group Sponsorship packages Message-ID: We're currently in the process of planning for the 2019 UK Spectrum Scale User Group meeting, to be held in London on 8th/9th May and will again be looking for commercial sponsorship to support the event. I'll be sending a message out to companies who have previously sponsored us with details soon, however if you would like to be contacted about the sponsorship packages, please drop me an email and I'll include your company when we send out the details. Thanks Simon From techie879 at gmail.com Sat Feb 9 01:42:13 2019 From: techie879 at gmail.com (Imam Toufique) Date: Fri, 8 Feb 2019 17:42:13 -0800 Subject: [gpfsug-discuss] question on fileset / quotas Message-ID: Hi Everyone, I am very new to GPFS, just got the system up and running, now starting to get myself setting up filesets and quotas. I have a question, may be in has been answered already, somewhere in this forum, my apologies for that if this is a repeat. My question is: lets say I have an independent fileset called '/mmfs1/crsp_test' , and I set it's quota to 2GB ( quota type FILESET ). STAGING-itoufiqu at crspoitdc-mgmt-001:/mmfs1/crsp_test/itoufiqu$ df -h /mmfs1/crsp_test/ Filesystem Size Used Avail Use% Mounted on mmfs1 2.0G 0 2.0G 0% /mmfs1 Now, I go create a 'dependent' fileset calld 'itoufiqu' under 'crsp_test' , sharing it's inode space, and i was able to set it's quota to 4GB. STAGING-root at crspoitdc-mgmt-001:/mmfs1/crsp_test$ df -h /mmfs1/crsp_test/itoufiqu Filesystem Size Used Avail Use% Mounted on mmfs1 4.0G 4.0G 0 100% /mmfs1 Now, i assume that setting quota of 4GB ( whereas the independent fileset quota is 2GB ) for the above dependent fileset ( 'itoufiqu' ) is being allowed as dependent fileset is sharing inode space from the independent fileset. Is there a way to setup an independent fileset so that it's dependent filesets cannot exceed its quota limit? Another words, if my independent fileset quota is 2GB, I should not be allowed to set quotas for it's dependent filesets more then 2GB ( for the dependent filesets created in aggregate ) ? Thanks for your help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Feb 9 04:02:49 2019 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 08 Feb 2019 23:02:49 -0500 Subject: [gpfsug-discuss] question on fileset / quotas In-Reply-To: References: Message-ID: <32045.1549684969@turing-police.cc.vt.edu> On Fri, 08 Feb 2019 17:42:13 -0800, Imam Toufique said: > Is there a way to setup an independent fileset so that it's dependent > filesets cannot exceed its quota limit? Another words, if my independent > fileset quota is 2GB, I should not be allowed to set quotas for it's > dependent filesets more then 2GB ( for the dependent filesets created in > aggregate ) ? Well.. to set the quota on the dependent fileset, you have to be root. And the general Unix/Linux philosophy is to not prevent the root user from doing things unless there's a good technical reason(*). There's a lot of "here be dragons" corner cases - for instance, if I create /gpfs/parent and give it 10T of space, does that mean that *each* dependent fileset is limited to 10T, or the *sum* has to remain under 10T? (In other words, is overcommit allowed?). There's other problems, like "Give the parent 8T, give two children 4T each, let each one store 3T, and then reduce the parent quota to 2T" - what should happen then? And quite frankly, the fact that mmrepquota has an entire column of output for "uncertain" when only dealing with *one* fileset tells me there's not sane way to avoid race conditions when dealing with two filesets without some truly performance-ruining levels of filesystem locking. So I'd say that probably, it's more reasonable to do this outside GPFS - anything from telling everybody who knows the root password not to do it, to teaching whatever automation/provisioning system you have (Ansible, etc) to enforce it. Having said that, if you can nail down the semantics and then make a good business case that it should be done inside of GPFS rather than at the sysadmin level, I'm sure IBM would be willing to at least listen to an RFE.... (*) I remember one Unix variant (Gould's UTX/32) that was perfectly willing to let C code running as root do an unlink(".") rather than return EISDIR even though it meant you just bought yourself a shiny new fsck - don't ask how I found out :) From rohwedder at de.ibm.com Mon Feb 11 09:36:06 2019 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Mon, 11 Feb 2019 10:36:06 +0100 Subject: [gpfsug-discuss] question on fileset / quotas In-Reply-To: References: Message-ID: Hello, There is no hierarchy between fileset quotas, the fileset quota limits are completely independent of each other. The independent fileset., as you mentioned, provides the common inode space and ties the parent and child together in regards to using inodes from their common inode space and for example in regards to snapshots and other features that act on independent filesets. There are however many degrees of freedom in setting up quota configurations, for example user and group quotas and the per-fileset and per-filesystem quota options. So there may be other ways how you could create rules that can model your environment and which could provide a means to create limits across several filesets. For example (will probably not match to you but just to illustrate):: You have a group of applications. Each application stores data in one dependent fileset. The filesystem where these exist uses per filesystem quota accounting.- All these filesets are children of an independent filesets. this allows you to create snapshots of all applications together. All applications store data under the same group. You can limit each applications space via fileset quota and you can limit the whole application group via group quota. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland Research & Development E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany From: Imam Toufique To: gpfsug-discuss at spectrumscale.org Date: 09.02.2019 02:42 Subject: [gpfsug-discuss] question on fileset / quotas Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Everyone, I am very new to GPFS, just got the system up and running, now starting to get myself setting up filesets and quotas.? I have a question, may be in has been answered already, somewhere in this forum, my apologies for that if this is a repeat. My question is: lets say I have an independent fileset called '/mmfs1/crsp_test' , and I set it's quota to 2GB ( quota type FILESET ). STAGING-itoufiqu at crspoitdc-mgmt-001:/mmfs1/crsp_test/itoufiqu$ df -h /mmfs1/crsp_test/ Filesystem? ? ? Size? Used Avail Use% Mounted on mmfs1? ? ? ? ? ?2.0G? ? ?0? 2.0G? ?0% /mmfs1 Now, I go create a 'dependent' fileset calld 'itoufiqu' under 'crsp_test' , sharing it's inode space, and i was able to set it's quota to 4GB. STAGING-root at crspoitdc-mgmt-001:/mmfs1/crsp_test$ df -h /mmfs1/crsp_test/itoufiqu Filesystem? ? ? Size? Used Avail Use% Mounted on mmfs1? ? ? ? ? ?4.0G? 4.0G? ? ?0 100% /mmfs1 Now, i assume that setting quota of 4GB ( whereas the independent fileset quota is 2GB ) for the above dependent fileset ( 'itoufiqu' ) is being allowed as dependent fileset is sharing inode space from the independent fileset. Is there a way to setup an independent fileset so that it's dependent filesets cannot exceed its quota limit? Another words, if my independent fileset quota is 2GB, I should not be allowed to set quotas for it's dependent filesets more then 2GB? ( for the dependent filesets created in aggregate ) ? Thanks for your help! _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zufu2dI7tmWy-NT3JtWeBLKdOh7kh4HI2I8z4NyIRkc&s=IrYnLhlxx4D2HcHgbdFkE1S4Rmo3mFX9Q0TmnBd6iYg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 19326261.gif Type: image/gif Size: 4659 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From heiner.billich at psi.ch Tue Feb 12 17:45:25 2019 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Tue, 12 Feb 2019 17:45:25 +0000 Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP Message-ID: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> Hello, Can I change the preferred server for a ces address without actually moving the IP? In my case the IP already moved to the new server due to a failure on a second server. Now I would like the IP to stay even if the other server gets active again: I first want to move a test address only. But ?mmces address move? denies to run as the address already is on the server I want to make the preferred one. I also didn?t find where this address assignment is stored, I searched in the files available from ccr. Thank you, Heiner -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From sannaik2 at in.ibm.com Tue Feb 12 19:50:52 2019 From: sannaik2 at in.ibm.com (Sandeep Naik1) Date: Wed, 13 Feb 2019 01:20:52 +0530 Subject: [gpfsug-discuss] Unbalanced pdisk free space In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267DF159@MBX114.d.ethz.ch>, <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch> Message-ID: Hi Alvise, Here is response to your question in blue. Q - Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? Yes it is normal to see variation in free space between pdisks. The variation should be seen in the context of used space and not free space. GNR try to balance space equally across enclosures (failure groups). One enclosure has one SSD (per RG) so it has 41 disk in DA1 while the other one has 42. Enclosure with 42 disk show 360 GiB free space while one with 41 disk show 120 GiB. If you look at used capacity and distribute it equally between two enclosures you will notice that used capacity is almost same between two enclosure. 42 * (10240 - 360) ? 41 * (10240 - 120) I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? You should see in context of size of pdisk, which in your case in 10TB. The disk showing 120GB free is 98% full while the one showing 360GB free is 96% full. This free space is available for creating vdisks and should not be confused with free space available in filesystem. Your pdisk are by and large equally filled so there will be no impact on write performance because of small variation in free space. Hope this helps Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: "Dorigo Alvise (PSI)" To: gpfsug main discussion list Date: 31/01/2019 04:07 PM Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org They're attached. Thanks! Alvise From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of IBM Spectrum Scale [scale at us.ibm.com] Sent: Wednesday, January 30, 2019 9:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Alvise, Could you send us the output of the following commands from both server nodes. mmfsadm dump nspdclient > /tmp/dump_nspdclient. mmfsadm dump pdisk > /tmp/dump_pdisk. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2019 08:24 AM Subject: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, I've a Lenovo Spectrum Scale system DSS-G220 (software dss-g-2.0a) composed of 2x x3560 M5 IO server nodes 1x x3550 M5 client/support node 2x disk enclosures D3284 GPFS/GNR 4.2.3-7 Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? Should I open a PMR to IBM ? Many thanks, Alvise [attachment "rg1" deleted by Brian Herr/Poughkeepsie/IBM] [attachment "rg2" deleted by Brian Herr/Poughkeepsie/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "dump_nspdclient.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_nspdclient.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=rrqeq4UVHOFW9aaiAj-N7Lu6Z7UKBo4-0e3yINS47W0&s=n2t4qaUh-0mamutSSx0E-5j09DbZImKsbDoiM0enBcg&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Wed Feb 13 08:30:47 2019 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Wed, 13 Feb 2019 08:30:47 +0000 Subject: [gpfsug-discuss] Unbalanced pdisk free space In-Reply-To: References: <83A6EEB0EC738F459A39439733AE8045267DF159@MBX114.d.ethz.ch>, <83A6EEB0EC738F459A39439733AE8045267E32C0@MBX114.d.ethz.ch>, Message-ID: <83A6EEB0EC738F459A39439733AE8045267E81EC@MBX114.d.ethz.ch> Thank you, I've understood the math and the focus from free space to used one. The only thing the remain strange for me is that I've not seen something like this in other systems (IBM ESS GL2 and another Lenovo G240 and G260), but I guess that the reason could be that they have much less used space, and allocated vdisks. thanks, Alvise ________________________________ From: Sandeep Naik1 [sannaik2 at in.ibm.com] Sent: Tuesday, February 12, 2019 8:50 PM To: gpfsug main discussion list; Dorigo Alvise (PSI) Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Hi Alvise, Here is response to your question in blue. Q - Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? Yes it is normal to see variation in free space between pdisks. The variation should be seen in the context of used space and not free space. GNR try to balance space equally across enclosures (failure groups). One enclosure has one SSD (per RG) so it has 41 disk in DA1 while the other one has 42. Enclosure with 42 disk show 360 GiB free space while one with 41 disk show 120 GiB. If you look at used capacity and distribute it equally between two enclosures you will notice that used capacity is almost same between two enclosure. 42 * (10240 - 360) ? 41 * (10240 - 120) I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? You should see in context of size of pdisk, which in your case in 10TB. The disk showing 120GB free is 98% full while the one showing 360GB free is 96% full. This free space is available for creating vdisks and should not be confused with free space available in filesystem. Your pdisk are by and large equally filled so there will be no impact on write performance because of small variation in free space. Hope this helps Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: "Dorigo Alvise (PSI)" To: gpfsug main discussion list Date: 31/01/2019 04:07 PM Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ They're attached. Thanks! Alvise ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of IBM Spectrum Scale [scale at us.ibm.com] Sent: Wednesday, January 30, 2019 9:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Unbalanced pdisk free space Alvise, Could you send us the output of the following commands from both server nodes. * mmfsadm dump nspdclient > /tmp/dump_nspdclient. * mmfsadm dump pdisk > /tmp/dump_pdisk. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2019 08:24 AM Subject: [gpfsug-discuss] Unbalanced pdisk free space Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, I've a Lenovo Spectrum Scale system DSS-G220 (software dss-g-2.0a) composed of 2x x3560 M5 IO server nodes 1x x3550 M5 client/support node 2x disk enclosures D3284 GPFS/GNR 4.2.3-7 Can anybody tell me if it is normal that all the pdisks of both my recovery groups, residing on the same physical enclosure have free space equal to (more or less) 1/3 of the free space of the pdisks residing on the other physical enclosure (see attached text files for the command line output) ? I guess when the least free disks are fully occupied (while the others are still partially free) write performance will drop by a factor of two. Correct ? Is there a way (considering that the system is in production) to fix (rebalance) this free space among all pdisk of both enclosures ? Should I open a PMR to IBM ? Many thanks, Alvise [attachment "rg1" deleted by Brian Herr/Poughkeepsie/IBM] [attachment "rg2" deleted by Brian Herr/Poughkeepsie/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss [attachment "dump_nspdclient.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_nspdclient.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-1" deleted by Sandeep Naik1/India/IBM] [attachment "dump_pdisk.sf-dssio-2" deleted by Sandeep Naik1/India/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From nico.faerber at id.unibe.ch Fri Feb 15 11:59:43 2019 From: nico.faerber at id.unibe.ch (nico.faerber at id.unibe.ch) Date: Fri, 15 Feb 2019 11:59:43 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI Message-ID: Dear all, We see some outdated events ("longwaiters_warn" and "gpfs_warn" for component GPFS with an age of 2 weeks) for some nodes that are resolved in in the meantime ("mmhealth node show" on the affected nodes report HEALTHY for component GPFS). How can I remove those outdated event logs from the GUI? Is there a button/command or do I have to manually delete some records in the database? If yes, what is the recommended procedure? We are running: Cluster minimum release level: 4.2.3.0 GUI release level: 5.0.2-1 Thank you. Best, Nico Universitaet Bern Abt. Informatikdienste Nico F?rber High Performance Computing Gesellschaftsstrasse 6 CH-3012 Bern Raum 104 Tel. +41 (0)31 631 51 89 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Ritter at de.ibm.com Fri Feb 15 12:22:31 2019 From: Matthias.Ritter at de.ibm.com (Matthias Ritter) Date: Fri, 15 Feb 2019 12:22:31 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579920.png Type: image/png Size: 1167 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579921.png Type: image/png Size: 6645 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.155021689579922.png Type: image/png Size: 1167 bytes Desc: not available URL: From nico.faerber at id.unibe.ch Fri Feb 15 14:05:01 2019 From: nico.faerber at id.unibe.ch (nico.faerber at id.unibe.ch) Date: Fri, 15 Feb 2019 14:05:01 +0000 Subject: [gpfsug-discuss] Clear old/stale events in GUI In-Reply-To: References: Message-ID: <46BF617B-D84D-4D80-8C97-506243DFCAF5@id.unibe.ch> Dear Mr. Ritter, It worked. The stale events are gone. Thank you very much. Best, Nico --- Universit?t Bern Informatikdienste Gruppe Systemdienste Nico F?rber Systemadministrator HPC Hochschulstrasse 6 CH-3012 Bern Tel. +41 (0)31 631 51 89 mailto: grid-support at id.unibe.ch http://www.id.unibe.ch/ Von: im Auftrag von Matthias Ritter Antworten an: gpfsug main discussion list Datum: Freitag, 15. Februar 2019 um 13:22 An: "gpfsug-discuss at spectrumscale.org" Cc: "gpfsug-discuss at spectrumscale.org" Betreff: Re: [gpfsug-discuss] Clear old/stale events in GUI Hello Mr. F?rber, please run on each GUI node you have the following command: /usr/lpp/mmfs/gui/cli/lshealth --reset This should help clearing this stale events not shown by mmhealth. Mit freundlichen Gr??en / Kind regards [cid:155021689579920] [IBM Spectrum Scale] * Matthias Ritter Spectrum Scale GUI Development Department M069 / Spectrum Scale Software Development +49-7034-2744-1977 Matthias.Ritter at de.ibm.com [cid:155021689579922] IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Matthias Hartmann Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 ----- Urspr?ngliche Nachricht ----- Von: Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: CC: Betreff: [gpfsug-discuss] Clear old/stale events in GUI Datum: Fr, 15. Feb 2019 13:14 Dear all, We see some outdated events ("longwaiters_warn" and "gpfs_warn" for component GPFS with an age of 2 weeks) for some nodes that are resolved in in the meantime ("mmhealth node show" on the affected nodes report HEALTHY for component GPFS). How can I remove those outdated event logs from the GUI? Is there a button/command or do I have to manually delete some records in the database? If yes, what is the recommended procedure? We are running: Cluster minimum release level: 4.2.3.0 GUI release level: 5.0.2-1 Thank you. Best, Nico Universitaet Bern Abt. Informatikdienste Nico F?rber High Performance Computing Gesellschaftsstrasse 6 CH-3012 Bern Raum 104 Tel. +41 (0)31 631 51 89 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1168 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 6646 bytes Desc: image002.png URL: From Kevin.Buterbaugh at Vanderbilt.Edu Fri Feb 15 15:10:57 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Fri, 15 Feb 2019 15:10:57 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Message-ID: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.knister at gmail.com Sun Feb 17 14:26:23 2019 From: aaron.knister at gmail.com (Aaron Knister) Date: Sun, 17 Feb 2019 09:26:23 -0500 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu> wrote: > Hi All, > > Been reading man pages, docs, and Googling, and haven?t found a definitive > answer to this question, so I knew exactly where to turn? ;-) > > I?m dealing with some slow I/O?s to certain storage arrays in our > environments ? like really, really slow I/O?s ? here?s just one example > from one of my NSD servers of a 10 second I/O: > > 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv > dm-92 > > So here?s my question ? when mmdiag ?iohist tells me that that I/O took > slightly over 10 seconds, is that: > > 1. The time from when the NSD server received the I/O request from the > client until it shipped the data back onto the wire towards the client? > 2. The time from when the client issued the I/O request until it received > the data back from the NSD server? > 3. Something else? > > I?m thinking it?s #1, but want to confirm. Which one it is has very > obvious implications for our troubleshooting steps. Thanks in advance? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Feb 17 18:13:17 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sun, 17 Feb 2019 23:43:17 +0530 Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP In-Reply-To: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> References: <3E654746-B4CA-4317-A048-01B343838A54@psi.ch> Message-ID: @Frank, Can you please help with the below query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Billich Heinrich Rainer (PSI)" To: gpfsug main discussion list Date: 02/12/2019 11:18 PM Subject: [gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, Can I change the preferred server for a ces address without actually moving the IP? In my case the IP already moved to the new server due to a failure on a second server. Now I would like the IP to stay even if the other server gets active again: I first want to move a test address only. But ?mmces address move? denies to run as the address already is on the server I want to make the preferred one. I also didn?t find where this address assignment is stored, I searched in the files available from ccr. Thank you, Heiner -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=AfpcM3p1Ru44FyaSyGfml_GFX4T4mQGuaGNURp8MUSI&s=CaYKqK4hj0eunF_WiOWve6Iq3C4aqqSIV0xxDEM8zAQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Feb 17 19:01:24 2019 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Mon, 18 Feb 2019 00:31:24 +0530 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zgeKdB1auU2SQrpQXrxc88rzoAWczKl_H7fqsgwqpv0&s=vbOLNkf-Y_NBNABzd8Enw14ykpYN2q5SoQLkAKiGIrU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Sun Feb 17 18:59:37 2019 From: oehmes at gmail.com (Sven Oehme) Date: Sun, 17 Feb 2019 10:59:37 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <447A32D4-5B23-47F3-B55C-6B51D411BD67@gmail.com> If you run it on the client, it includes local queuing, network as well as NSD Server processing and the actual device I/O time. if issued on the NSD Server it contains processing and I/O time, the processing shouldn?t really add any overhead but in some cases I have seen it contributing. If you corelate the client and server iohist outputs you can find the server entry based on the tags in the iohist output, this allows you to see exactly how much time was spend on network vs on the server to rule out network as the problem. Sven From: on behalf of Aaron Knister Reply-To: gpfsug main discussion list Date: Sunday, February 17, 2019 at 6:26 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scrusan at ddn.com Mon Feb 18 02:48:22 2019 From: scrusan at ddn.com (Steve Crusan) Date: Mon, 18 Feb 2019 02:48:22 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: , Message-ID: Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. >From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Feb 19 19:46:36 2019 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 19 Feb 2019 11:46:36 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <6884F598-9039-4163-BD56-7D9E0C815044@gmail.com> Just to add a bit more details to that, If you want to track down an individual i/o or all i/o to a particular file you can do this with mmfsadm dump iohist (mmdiag doesn?t give you all you need) : so run /usr/lpp/mmfs/bin/mmfsadm dump iohist >iohist on server as well as client : I/O history: I/O start time RW??? Buf type disk:sectorNum???? nSec? time ms????? tag1???????? tag2?????????? Disk UID typ??????? NSD node?? context thread?????????????????????????? comment --------------- -- ----------- ----------------- -----? ------- --------- ------------ ------------------ --- --------------- --------- -------------------------------- ------- 12:22:41.880663? W??????? data??? 1:5602050048?? 32768? 927.272 258249737????????? 900? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.038653? W??????? data??? 4:5815107584?? 32768? 803.106 258249737????????? 903? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.504966? W??????? data??? 3:695664640??? 32768? 375.272 258249737????????? 918? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.592712? W??????? data??? 1:1121779712?? 32768? 311.026 258249737????????? 920? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.641689? W??????? data??? 2:1334837248?? 32768? 350.373 258249737????????? 921? 0A2405A0:5C6BD9B4 cli? 172.16.254.161 Prefetch? WritebehindWorkerThread 12:22:42.301120? W??????? data??? 1:6667337728?? 32768? 758.629 258249737????????? 912? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.176365? W??????? data??? 1:6241222656?? 32768? 895.423 258249737??????? ??908? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.283152? W??????? data??? 4:6454280192?? 32768? 840.528 258249737????????? 911? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.149964? W??????? data??? 4:6028165120?? 32768? 981.661 258249737????????? 907? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.130402? W??????? data??? 3:6028165120?? 32768 1021.175 258249737????????? 906? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.838850? W??????? data??? 2:1867481088?? 32768? 343.912 258249737????????? 925? 0A2405A0:5C6BD9B4 cli? 172.16.254.161 Prefetch? WritebehindWorkerThread 12:22:42.841800? W??????? data??? 3:1867481088?? 32768? 397.089 258249737????????? 926? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.652912? W??????? data??? 3:1334837248?? 32768? 637.628 258249737????????? 922? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.883946? W??????? data??? 1:1974009856?? 32768? 442.953 258249737????????? 928? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread 12:22:42.903782? W??????? data??? 3:1974009856?? 32768? 424.285 258249737??????? ??930? 0A2405A0:5C6BD9B2 cli? 172.16.254.162 Prefetch? WritebehindWorkerThread 12:22:42.329905? W??????? data??? 4:269549568??? 32768 1061.313 258249737????????? 915? 0A2405A0:5C6BD9B6 cli? 172.16.254.163 Prefetch? WritebehindWorkerThread 12:22:42.392467? W??????? data??? 1:376078336??? 32768? 998.770 258249737????????? 916? 0A2405A0:5C6BD9B0 cli? 172.16.254.160 Prefetch? WritebehindWorkerThread in this example I only care about one file with inode number 258249737 (which is stored in tag1) : Now simply on the server run : grep '258249737' iohist? 19:22:42.533259? W??????? data??? 1:5602050048?? 32768? 283.016 258249737????????? 900? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.604062? W??????? data??? 1:1121779712?? 32768? 308.015 258249737????????? 920? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.751549? W??????? data??? 1:6667337728?? 32768? 316.536 258249737??? ??????912? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:42.722716? W??????? data??? 1:6241222656?? 32768? 357.409 258249737????????? 908? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:43.030353? W??????? data??? 1:1974009856?? 32768? 304.887 258249737????????? 928? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread 19:22:43.103745? W??????? data??? 1:376078336??? 32768? 295.835 258249737????????? 916? 0A2405A0:5C6BD9B0 srv? 172.16.245.117 NSDWorker NSDThread So you can now see all the blocks of that file (tag 2) that went to this particular nsd server and how much time they took to issue against the media . so for each tag1:tag2 pair on the client you find the corresponding on the server. If you subtract time of server from time of client for each line you get network/client delays . Sven From: on behalf of Steve Crusan Reply-To: gpfsug main discussion list Date: Tuesday, February 19, 2019 at 12:29 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. >From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkavitsky at 23andme.com Tue Feb 19 21:02:51 2019 From: jkavitsky at 23andme.com (Jim Kavitsky) Date: Tue, 19 Feb 2019 13:02:51 -0800 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> Steve, Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that? -jimk > On Feb 17, 2019, at 6:48 PM, Steve Crusan wrote: > > Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. > > From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. > > For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). > > Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. > > What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. > > I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. > > -Steve > From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister > Sent: Sunday, February 17, 2019 8:26:23 AM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output > > Hi Kevin, > > It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. > > Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). > > 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 > 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) > 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 > > 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 > 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 > 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 > 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 > 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 > 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 > 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 > 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC > > 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 > > 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None > 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 > 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 > 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 > > 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 > 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) > > I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. > > -Aaron > > On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: > Hi All, > > Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) > > I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: > > 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 > > So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: > > 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? > 2. The time from when the client issued the I/O request until it received the data back from the NSD server? > 3. Something else? > > I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Feb 20 16:52:28 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 20 Feb 2019 16:52:28 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> References: <3B5D4F25-0AD9-4C97-ADB4-CD999309F38E@23andme.com> Message-ID: Hi Jim, Please see: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/NSD%20Server%20Tuning Yes, those tuning parameters will show up in the mmlsconfig / mmdiag ?config output. HTH? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 19, 2019, at 3:02 PM, Jim Kavitsky > wrote: Steve, Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that? -jimk On Feb 17, 2019, at 6:48 PM, Steve Crusan > wrote: Context is key here. Where you run mmdiag?iohist matters, clientside or nsd server side. From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk. For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don?t fully remember). Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server. What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would?ve appeared things were fine, because the IO backend was sitting around twirling its thumbs. I don?t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT. -Steve ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Aaron Knister > Sent: Sunday, February 17, 2019 8:26:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Hi Kevin, It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code. Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX). 0.000218276 37005 TRACE_DISK: doReplicatedRead: da 34:490710888 0.000218424 37005 TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread) 0.000218566 37005 TRACE_DLEASE: checkLeaseForIO: rc 0 0.000218628 37005 TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0 0.000218672 37005 TRACE_FS: verify4KIO exit: code 4 err 0 0.000219106 37005 TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0 0.000219408 37005 TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0 0.000220105 37005 TRACE_TS: sendMessage msg_id 22993695: dest 10.3.17.3 sto03 0.000220436 37005 TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node msg_id 22993695 len 92 0.000221111 37005 TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11 0.000221662 37005 TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC 0.000426716 16691 TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144 0.000432140 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None 0.000432163 37005 TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0 0.000433707 37005 TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0 0.000433777 37005 TRACE_NSD: nsdDoIO exit: err 0 0 0.000433844 37005 TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0 0.000434236 37005 TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread) I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning. -Aaron On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L > wrote: Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd18386b226474395328208d696ada1a9%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636862069849536067&sdata=O5p52oLmSxQMWo2wwkVx8Z%2FapYpsAU9lAJ2cKvB095c%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Feb 19 20:26:31 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 19 Feb 2019 20:26:31 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components: 1) the NSD servers 2) the SAN fabric 3) the storage arrays I?ve been looking at all of the above and none of them are showing any obvious problems. I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue. And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output): Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 73, total processed 4859732 mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue: Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 71, total processed 2332966 mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31 So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value. Question: are those non-zero ?highest pending? values something to be concerned about? I have the following thread-related parameters set: [common] maxReceiverThreads 12 nsdMaxWorkerThreads 640 nsdThreadsPerQueue 4 nsdSmallThreadRatio 3 workerThreads 128 [serverLicense] nsdMaxWorkerThreads 1024 nsdThreadsPerQueue 12 nsdSmallThreadRatio 1 pitWorkerThreadsPerNode 3 workerThreads 1024 Also, at the top of the ?mmfsadm saferdump nsd? output I see: Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216 nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 Question: is the fact that 1008 is pretty close to 1024 a concern? Anything jump out at anybody? I don?t mind sharing full output, but it is rather lengthy. Is this worthy of a PMR? Thanks! -- Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale > wrote: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Feb 19 20:26:31 2019 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 19 Feb 2019 20:26:31 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: Message-ID: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components: 1) the NSD servers 2) the SAN fabric 3) the storage arrays I?ve been looking at all of the above and none of them are showing any obvious problems. I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue. And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output): Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 73, total processed 4859732 mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue: Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0 requests pending 0, highest pending 71, total processed 2332966 mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31 So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value. Question: are those non-zero ?highest pending? values something to be concerned about? I have the following thread-related parameters set: [common] maxReceiverThreads 12 nsdMaxWorkerThreads 640 nsdThreadsPerQueue 4 nsdSmallThreadRatio 3 workerThreads 128 [serverLicense] nsdMaxWorkerThreads 1024 nsdThreadsPerQueue 12 nsdSmallThreadRatio 1 pitWorkerThreadsPerNode 3 workerThreads 1024 Also, at the top of the ?mmfsadm saferdump nsd? output I see: Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216 nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 Question: is the fact that 1008 is pretty close to 1024 a concern? Anything jump out at anybody? I don?t mind sharing full output, but it is rather lengthy. Is this worthy of a PMR? Thanks! -- Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale > wrote: Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 02/16/2019 08:18 PM Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 W data 30:41615622144 2048 10115.192 srv dm-92 So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. The time from when the client issued the I/O request until it received the data back from the NSD server? 3. Something else? I?m thinking it?s #1, but want to confirm. Which one it is has very obvious implications for our troubleshooting steps. Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 21 12:10:41 2019 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 21 Feb 2019 14:10:41 +0200 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Feb 21 12:23:32 2019 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 21 Feb 2019 12:23:32 +0000 Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu>, Message-ID: An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Thu Feb 21 12:54:20 2019 From: jjdoherty at yahoo.com (Jim Doherty) Date: Thu, 21 Feb 2019 12:54:20 +0000 (UTC) Subject: [gpfsug-discuss] Clarification of mmdiag --iohist output In-Reply-To: References: <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> <9338621C-3F85-48DF-AE42-64998680E14C@vanderbilt.edu> Message-ID: <1280046520.2777074.1550753660364@mail.yahoo.com> Are all of the slow IOs from the same NSD volumes???? You could run an mmtrace and take an internaldump and open a ticket to the Spectrum Scale queue.? You may want to limit the run to just your nsd servers and not all nodes like I use in my example.???? Or one of the tools we use to review a trace is available in /usr/lpp/mmfs/samples/debugtools/trsum.awk?? and you can run it passing in the uncompressed trace file and redirect standard out to a file.???? If you search for ' total '? in the trace you will find the different sections,? or you can just grep ' total IO ' trsum.out? | grep duration? to get a quick look per LUN. mmtracectl --set --trace=def --tracedev-write-mode=overwrite --tracedev-overwrite-buffer-size=500M -N all mmtracectl --start -N all ; sleep 30 ; mmtracectl --stop -N all? ; mmtracectl --off -N all mmdsh -N all "/usr/lpp/mmfs/bin/mmfsadm dump all >/tmp/mmfs/service.dumpall.\$(hostname)" Jim On Thursday, February 21, 2019, 7:23:46 AM EST, Frederick Stock wrote: Kevin I'm assuming you have seen the article on IBM developerWorks about the GPFS NSD queues.? It provides useful background for analyzing the dump nsd information.? Here I'll list some thoughts for items that you can investigate/consider.?If your NSD servers are doing both large (greater than 64K) and small (64K or less) IOs then you want to have the nsdSmallThreadRatio set to 1 as it seems you do for the NSD servers.? This provides an equal number of SMALL and LARGE NSD queues.? You can also increase the total number of queues (currently 256) but I cannot determine if that is necessary from the data you provided.? Only on rare occasions have I seen a need to increase the number of queues.?The fact that you have 71 highest pending on your LARGE queues and 73 highest pending on your SMALL queues would imply your IOs are queueing for a good while either waiting for resources in GPFS or waiting for IOs to complete.? Your maximum buffer size is 16M which is defined to be the largest IO that can be requested by GPFS.? This is the buffer size that GPFS will use for LARGE IOs.? You indicated you had sufficient memory on the NSD servers but what is the value for the pagepool on those servers, and what is the value of the nsdBufSpace parameter??? If the NSD server is just that then usually nsdBufSpace is set to 70.? The IO buffers used by the NSD server come from the pagepool so you need sufficient space there for the maximum number of LARGE IO buffers that would be used concurrently by GPFS or threads will need to wait for those buffers to become available.? Essentially you want to ensure you have sufficient memory for the maximum number of IOs all doing a large IO and that value being less than 70% of the pagepool size.?You could look at the settings for the FC cards to ensure they are configured to do the largest IOs possible.? I forget the actual values (have not done this for awhile) but there are settings for the adapters that control the maximum IO size that will be sent.? I think you want this to be as large as the adapter can handle to reduce the number of messages needed to complete the large IOs done by GPFS.??Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com?? ----- Original message ----- From: "Buterbaugh, Kevin L" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output Date: Thu, Feb 21, 2019 6:39 AM ? Hi All,?My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team. ?You confirmed what I suspected ? my example 10 second I/O was _from an NSD server_ ? and since we?re in a 8 Gb FC SAN environment, it therefore means - correct me if I?m wrong about this someone - that I?ve got a problem somewhere in one (or more) of the following 3 components:?1) the NSD servers2) the SAN fabric3) the storage arrays?I?ve been looking at all of the above and none of them are showing any obvious problems. ?I?ve actually got a techie from the storage array vendor stopping by on Thursday, so I?ll see if he can spot anything there. ?Our FC switches are QLogic?s, so I?m kinda screwed there in terms of getting any help. ?But I don?t see any errors in the switch logs and ?show perf? on the switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I don?t _think_ that?s the issue.?And this is the GPFS mailing list, after all ? so let?s talk about the NSD servers. ?Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620?s) appear to be an issue. ?But I have been looking at the output of ?mmfsadm saferdump nsd? based on what Aaron and then Steve said. ?Here?s some fairly typical output from one of the SMALL queues (I?ve checked several of my 8 NSD servers and they?re all showing similar output):?? ? Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0? ? ?requests pending 0, highest pending 73, total processed 4859732? ? ?mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 0x7F3E449B8EF0, nFreeNsdRequests 29?And for a LARGE queue:?? ? Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0? ? ?requests pending 0, highest pending 71, total processed 2332966? ? ?mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 31?So my large queues seem to be slightly less utilized than my small queues overall ? i.e. I see more inactive large queues and they generally have a smaller ?highest pending? value.?Question: ?are those non-zero ?highest pending? values something to be concerned about??I have the following thread-related parameters set:?[common]maxReceiverThreads 12nsdMaxWorkerThreads 640nsdThreadsPerQueue 4nsdSmallThreadRatio 3workerThreads 128?[serverLicense]nsdMaxWorkerThreads 1024nsdThreadsPerQueue 12nsdSmallThreadRatio 1pitWorkerThreadsPerNode 3workerThreads 1024?Also, at the top of the ?mmfsadm saferdump nsd? output I see:?Total server worker threads: running 1008, desired 147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024?Question: ?is the fact that 1008 is pretty close to 1024 a concern??Anything jump out at anybody? ?I don?t mind sharing full output, but it is rather lengthy. ?Is this worthy of a PMR??Thanks!?--Kevin Buterbaugh - Senior System AdministratorVanderbilt University - Advanced Computing Center for Research and EducationKevin.Buterbaugh at vanderbilt.edu?- (615)875-9633? On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale wrote:?Hi Kevin, The I/O hist shown by the command mmdiag --iohist actually depends on the node on which you are running this command from. If you are running this on a NSD server node then it will show the time taken to complete/serve the read or write I/O operation sent from the client node. And if you are running this on a client (or non NSD server) node then it will show the complete time taken by the read or write I/O operation requested by the client node to complete. So in a nut shell for the NSD server case it is just the latency of the I/O done on disk by the server whereas for the NSD client case it also the latency of send and receive of I/O request to the NSD server along with the latency of I/O done on disk by the NSD server. I hope this answers your query. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of ?Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact ?1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: ? ? ? ?"Buterbaugh, Kevin L" To: ? ? ? ?gpfsug main discussion list Date: ? ? ? ?02/16/2019 08:18 PM Subject: ? ? ? ?[gpfsug-discuss] Clarification of mmdiag --iohist output Sent by: ? ? ? ?gpfsug-discuss-bounces at spectrumscale.org Hi All, Been reading man pages, docs, and Googling, and haven?t found a definitive answer to this question, so I knew exactly where to turn? ;-) I?m dealing with some slow I/O?s to certain storage arrays in our environments ? like really, really slow I/O?s ? here?s just one example from one of my NSD servers of a 10 second I/O: 08:49:34.943186 ?W ? ? ? ?data ? 30:41615622144 ? 2048 10115.192 ?srv ? dm-92 ? ? ? ? ? ? ? ? ? So here?s my question ? when mmdiag ?iohist tells me that that I/O took slightly over 10 seconds, is that: 1. ?The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client? 2. ?The time from when the client issued the I/O request until it received the data back from the NSD server? 3. ?Something else? I?m thinking it?s #1, but want to confirm. ?Which one it is has very obvious implications for our troubleshooting steps. ?Thanks in advance? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Feb 26 12:38:11 2019 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 26 Feb 2019 12:38:11 +0000 Subject: [gpfsug-discuss] Save the date: US User Group meeting April 16-17th, NCAR Boulder CO Message-ID: It?s coming up fast - mark your calendar if plan on attending. We?ll be publishing detailed agenda information and registration soon. If you?d like to present, please drop me a note. We have a limited number of slots available. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.holliday at crick.ac.uk Tue Feb 26 10:45:32 2019 From: michael.holliday at crick.ac.uk (Michael Holliday) Date: Tue, 26 Feb 2019 10:45:32 +0000 Subject: [gpfsug-discuss] relion software using GPFS storage Message-ID: Hi All, We've recently had an issue where a job on our client GPFS cluster caused out main storage to go extremely slowly. The job was running relion using MPI (https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page) It caused waiters across the cluster, and caused the load to spike on NSDS on at a time. When the spike ended on one NSD, it immediately started on another. There were no obvious errors in the logs and the issues cleared immediately after the job was cancelled. Has anyone else see any issues with relion using GPFS storage? Michael Michael Holliday RITTech MBCS Senior HPC & Research Data Systems Engineer | eMedLab Operations Team Scientific Computing STP | The Francis Crick Institute 1, Midland Road | London | NW1 1AT | United Kingdom Tel: 0203 796 3167 The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at strubi.ox.ac.uk Wed Feb 27 12:49:38 2019 From: robert at strubi.ox.ac.uk (Robert Esnouf) Date: Wed, 27 Feb 2019 12:49:38 +0000 Subject: [gpfsug-discuss] relion software using GPFS storage In-Reply-To: References: Message-ID: <9aee9d18ed77ad61c1b44859703f2284@strubi.ox.ac.uk> Dear Michael, There are settings within relion for parallel file systems, you should check they are enabled if you have SS underneath. Otherwise, check which version of relion and then try to understand the problem that is being analysed a little more. If the box size is very small and the internal symmetry low then the user may read 100,000s of small "picked particle" files for each iteration opening and closing the files each time. I believe that relion3 has some facility for extracting these small particles from the larger raw images and that is more SS-friendly. Alternatively, the size of the set of picked particles is often only in 50GB range and so staging to one or more local machines is quite feasible... Hope one of those suggestions helps. Regards, Robert -- Dr Robert Esnouf University Research Lecturer, Director of Research Computing BDI, Head of Research Computing Core WHG, NDM Research Computing Strategy Officer Main office: Room 10/028, Wellcome Centre for Human Genetics, Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK Emails: robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at bdi.ox.ac.uk Tel: (+44)-1865-287783 (WHG); (+44)-1865-743689 (BDI) ? -----Original Message----- From: "Michael Holliday" To: gpfsug-discuss at spectrumscale.org Date: 27/02/19 12:21 Subject: [gpfsug-discuss] relion software using GPFS storage Hi All, ? We?ve recently had an issue where a job on our client GPFS cluster caused out main storage to go extremely slowly.? ?The job was running relion using MPI (https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page) ? It caused waiters across the cluster, and caused the load to spike on NSDS on at a time.? When the spike ended on one NSD, it immediately started on another.? ? There were no obvious errors in the logs and the issues cleared immediately after the job was cancelled.? ? Has anyone else see any issues with relion using GPFS storage? ? Michael ? Michael Holliday RITTech MBCS Senior HPC & Research Data Systems Engineer | eMedLab Operations Team Scientific Computing STP | The Francis Crick Institute 1, Midland Road | London | NW1 1AT | United Kingdom Tel: 0203 796 3167 ? The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Feb 27 20:12:54 2019 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 27 Feb 2019 20:12:54 +0000 Subject: [gpfsug-discuss] Registration now open! - US User Group Meeting, April 16-17th, NCAR Boulder Message-ID: <671D229B-C7A1-459D-A42B-DB93502F59FA@nuance.com> Registration is now open: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2019-meeting-tickets-57035376346 Please note that agenda details are not set yet but these will be finalized in the next few weeks - when they are I will post to the registration page and the mailing list. - April 15th: Informal social gather on Monday for those arriving early (location TBD) - April 16th: Full day of talks from IBM and the user community, Social and Networking Event (details TBD) - April 17th: Talks and breakout sessions (If you have any topics for the breakout sessions, let us know) Looking forward to seeing everyone in Boulder! Bob Oesterlin/Kristy Kallback-Rose -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.dietrich at desy.de Thu Feb 28 07:56:56 2019 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Thu, 28 Feb 2019 08:56:56 +0100 (CET) Subject: [gpfsug-discuss] CES Ganesha netgroup caching? Message-ID: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> Hi, I am currently playing around with LDAP netgroups for NFS exports via CES. However, I could not figure out how long Ganesha is caching the netgroup entries? There is definitely some caching, as adding a host to the netgroup does not immediately grant access to the share. A "getent netgroup " on the CES node returns the correct result, so this is not some other caching effect. Resetting the cache via "ganesha_mgr purge netgroup" works, but is probably not officially supported. The CES nodes are running with GPFS 5.0.2.3 and gpfs.nfs-ganesha-2.5.3-ibm030.01.el7. CES authentication is set to user-defined, the nodes just use SSSD with a rfc2307bis LDAP server. Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ From mnaineni at in.ibm.com Thu Feb 28 12:33:50 2019 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Thu, 28 Feb 2019 12:33:50 +0000 Subject: [gpfsug-discuss] CES Ganesha netgroup caching? In-Reply-To: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> References: <2121724779.6221169.1551340616921.JavaMail.zimbra@desy.de> Message-ID: An HTML attachment was scrubbed... URL: