From chair at spectrumscale.org Thu Mar 1 11:26:12 2018 From: chair at spectrumscale.org (Simon Thompson) Date: Thu, 01 Mar 2018 11:26:12 +0000 Subject: [gpfsug-discuss] UK April meeting Message-ID: <26357FF0-F04B-4A37-A8A5-062CB0160D19@spectrumscale.org> Hi All, We?ve just posted the draft agenda for the UK meeting in April at: http://www.spectrumscaleug.org/event/uk-2018-user-group-event/ So far, we?ve issued over 50% of the available places, so if you are planning to attend, please do register now! Please register at: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList We?ve also confirmed our evening networking/social event between days 1 and 2 with thanks to our sponsors for supporting this. Please remember that we are currently limiting to two registrations per organisation. We?d like to thank our sponsors from DDN, E8, Ellexus, IBM, Lenovo, NEC and OCF for supporting the event. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Thu Mar 1 08:41:59 2018 From: john.hearns at asml.com (John Hearns) Date: Thu, 1 Mar 2018 08:41:59 +0000 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: In reply to Stuart, our setup is entirely Infiniband. We boot and install over IB, and rely heavily on IP over Infiniband. As for users being 'confused' due to multiple IPs, I would appreciate some more depth on that one. Sure, all batch systems are sensitive to hostnames (as I know to my cost!) but once you get that straightened out why should users care? I am not being aggressive, just keen to find out more. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Stuart Barkley Sent: Wednesday, February 28, 2018 6:50 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB The problem with CM is that it seems to require configuring IP over Infiniband. I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, but pulled it out of our environment as adding unneeded complexity. It requires provisioning IP addresses across the Infiniband infrastructure and possibly adding routers to other portions of the IP infrastructure. It was also confusing some users due to multiple IPs on the compute infrastructure. We have recently been in discussions with a vendor about their support for GPFS over IB and they kept directing us to using CM (which still didn't work). CM wasn't necessary once we found out about the actual problem (we needed the undocumented verbsRdmaUseGidIndexZero configuration option among other things due to their use of SR-IOV based virtual IB interfaces). We don't use routed Infiniband and it might be that CM and IPoIB is required for IB routing, but I doubt it. It sounds like the OP is keeping IB and IP infrastructure separate. Stuart Barkley On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > Date: Mon, 26 Feb 2018 14:16:34 > From: Aaron Knister > Reply-To: gpfsug main discussion list > > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > Hi Jan Erik, > > It was my understanding that the IB hardware router required RDMA CM to work. > By default GPFS doesn't use the RDMA Connection Manager but it can be > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > clients/servers (in both clusters) to take effect. Maybe someone else > on the list can comment in more detail-- I've been told folks have > successfully deployed IB routers with GPFS. > > -Aaron > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > Dear all > > > > we are currently trying to remote mount a file system in a routed > > Infiniband test setup and face problems with dropped RDMA > > connections. The setup is the > > following: > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > connected to the same infiniband network. Additionally they are > > connected to a fast ethernet providing ip communication in the network 192.168.11.0/24. > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > are connected to a second infiniband network. These servers have IPs > > on their IB interfaces in the network 192.168.12.0/24. > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > dedicated machine. > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > We tested that the routing, both IP and IB, is working between the > > two clusters without problems and that RDMA is working fine both for > > internal communication inside cluster 1 and cluster 2 > > > > When trying to remote mount a file system from cluster 1 in cluster > > 2, RDMA communication is not working as expected. Instead we see > > error messages on the remote host (cluster 2) > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 2 > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 2 > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 3 > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 3 > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 1 > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 1 > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 1 > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 0 > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 0 > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 0 > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 2 > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 2 > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 2 > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 3 > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 3 > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > > > > > and in the cluster with the file system (cluster 1) > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > Any advice on how to configure the setup in a way that would allow > > the remote mount via routed IB would be very appreciated. > > > > > > Thank you and best regards > > Jan Erik > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > YpqcNNP8%3D&reserved=0 > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > %3D&reserved=0 > -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From lavila at illinois.edu Thu Mar 1 15:02:24 2018 From: lavila at illinois.edu (Avila-Diaz, Leandro) Date: Thu, 1 Mar 2018 15:02:24 +0000 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. [Inactive hide details for "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is]"Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: From bzhang at ca.ibm.com Thu Mar 1 22:47:57 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Thu, 1 Mar 2018 17:47:57 -0500 Subject: [gpfsug-discuss] Spectrum Scale Support Webinar - File Audit Logging Message-ID: You are receiving this message because you are an IBM Spectrum Scale Client and in GPFS User Group. IBM Spectrum Scale Support Webinar File Audit Logging About this Webinar IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices based on various use cases. This webinar will discuss fundamentals of the new File Audit Logging function including configuration and key best practices that will aid you in successful deployment and use of File Audit Logging within Spectrum Scale. Please note that our webinars are free of charge and will be held online via WebEx. Agenda: ? Overview of File Audit Logging ? Installation and deployment of File Audit Logging ? Using File Audit Logging ? Monitoring and troubleshooting File Audit Logging ? Q&A NA/EU Session Date: March 14, 2018 Time: 11 AM ? 12PM EDT (4PM GMT) Registration: https://ibm.biz/BdZsZz Audience: Spectrum Scale Administrators AP/JP Session Date: March 15, 2018 Time: 10AM ? 11AM Beijing Time (11AM Tokyo Time) Registration: https://ibm.biz/BdZsZf Audience: Spectrum Scale Administrators If you have any questions, please contact Robert Simon, Jun Hui Bu, Vlad Spoiala and Bohai Zhang. Regards, IBM Spectrum Scale Support Team Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71100731.jpg Type: image/jpeg Size: 21904 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71151195.jpg Type: image/jpeg Size: 17787 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71943442.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71224521.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71350284.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71371859.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71584384.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71592777.gif Type: image/gif Size: 1243 bytes Desc: not available URL: From Greg.Lehmann at csiro.au Fri Mar 2 03:48:44 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 2 Mar 2018 03:48:44 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Message-ID: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won't run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Mar 2 05:15:21 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 2 Mar 2018 13:15:21 +0800 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: Hi, The verification/test work is still ongoing. Hopefully GPFS will publish statement soon. I think it would be available through several channels, such as FAQ. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Avila-Diaz, Leandro" To: gpfsug main discussion list Date: 03/01/2018 11:17 PM Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. Inactive hide details for "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is"Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=qFtjLJBRsEewfEfVZBW__Xk8CD9w04bJZpK0sJiCze0&s=LyDrwavwKGQHDl4DVW6-vpW2bjmJBtXrGGcFfDYyI4o&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 19119307.gif Type: image/gif Size: 106 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Fri Mar 2 16:33:46 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Fri, 2 Mar 2018 16:33:46 +0000 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: <6BBDFC67-D61F-4477-BF8A-1551925AF955@vanderbilt.edu> Hi Leandro, I think the silence in response to your question says a lot, don?t you? :-O IBM has said (on this list, I believe) that the Meltdown / Spectre patches do not impact GPFS functionality. They?ve been silent as to performance impacts, which can and will be taken various ways. In the absence of information from IBM, the approach we have chosen to take is to patch everything except our GPFS servers ? only we (the SysAdmins, oh, and the NSA, of course!) can log in to them, so we feel that the risk of not patching them is minimal. HTHAL? Kevin On Mar 1, 2018, at 9:02 AM, Avila-Diaz, Leandro > wrote: Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: > on behalf of IBM Spectrum Scale > Reply-To: gpfsug main discussion list > Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum athttps://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Ceec49ab3ce144a81db3d08d57f86b59d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636555138937139546&sdata=%2FFS%2FQzdMP4d%2Bgf4wCUPR7KOQxIIV6OABoaNrc0ySHdI%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Mar 5 15:01:28 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 5 Mar 2018 15:01:28 +0000 Subject: [gpfsug-discuss] More Details: US Spring Meeting - May 16-17th, Boston Message-ID: A few more details on the Spectrum Scale User Group US meeting. We are still finalizing the agenda, but expect two full days on presentations by IBM, users, and breakout sessions. We?re still looking for user presentations ? please contact me if you would like to present! Or if you have any topics that you?d like to see covered. Dates: Wednesday May 16th and Thursday May 17th Location: IBM Cambridge Innovation Center, One Rogers St , Cambridge, MA 02142-1203 (Near MIT and Boston) https://goo.gl/5oHSKo There are a number of nearby hotels. If you are considering coming, please book early. Boston has good public transport options, so if you book a bit farther out you may get a better price. More details on the agenda and a link to the sign-up coming in a few weeks. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Mar 5 23:49:04 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 5 Mar 2018 15:49:04 -0800 Subject: [gpfsug-discuss] RDMA data from Zimon In-Reply-To: References: Message-ID: <8EB2B774-1640-4AEA-A4ED-2D6DBEC3324E@lbl.gov> Thanks Eric. No one who is a ZIMon developer has jumped up to contradict this, so I?ll go with it :-) Many thanks. This is helpful to understand where the data is coming from and would be a welcome addition to the documentation. Cheers, Kristy > On Feb 15, 2018, at 9:08 AM, Eric Agar wrote: > > Kristy, > > I experimented a bit with this some months ago and looked at the ZIMon source code. I came to the conclusion that ZIMon is reporting values obtained from the IB counters (actually, delta values adjusted for time) and that yes, for port_xmit_data and port_rcv_data, one would need to multiply the values by 4 to make sense of them. > > To obtain a port_xmit_data value, the ZIMon sensor first looks for /sys/class/infiniband//ports//counters_ext/port_xmit_data_64, and if that is not found then looks for /sys/class/infiniband//ports//counters/port_xmit_data. Similarly for other counters/metrics. > > Full disclosure: I am not an IB expert nor a ZIMon developer. > > I hope this helps. > > > Eric M. Agar > agar at us.ibm.com > > > Kristy Kallback-Rose ---02/14/2018 08:47:59 PM---Hi, Can one of the IBMers tell me if port_xmit_data and port_rcv_data from Zimon can be interpreted > > From: Kristy Kallback-Rose > To: gpfsug main discussion list > Date: 02/14/2018 08:47 PM > Subject: [gpfsug-discuss] RDMA data from Zimon > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi, > > Can one of the IBMers tell me if port_xmit_data and port_rcv_data from Zimon can be interpreted as RDMA Bytes/sec? Ideally, also how this data is being collected? I?m looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1hlp_monnetworksmetrics.htm > > But then I also look here: https://community.mellanox.com/docs/DOC-2751 > > and see "Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.? So I wasn?t sure if some multiplication by 4 was in order. > > Please advise. > > Cheers, > Kristy_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zIRb70L9sx_FvvC9IcWVKLOSOOFnx-hIGfjw0kUN7bw&s=D1g4YTG5WeUiHI3rCPr_kkPxbG9V9E-18UGXBeCvfB8&e= > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Mar 6 12:49:26 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 6 Mar 2018 12:49:26 +0000 Subject: [gpfsug-discuss] tscCmdPortRange question Message-ID: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> We are looking at setting a value for tscCmdPortRange so that we can apply firewalls to a small number of GPFS nodes in one of our clusters. The docs don?t give an indication on the number of ports that are required to be in the range. Could anyone make a suggestion on this? It doesn?t appear as a parameter for ?mmchconfig -i?, so I assume that it requires the nodes to be restarted, however I?m not clear if we could do a rolling restart on this? Thanks Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Mar 6 18:48:40 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 6 Mar 2018 18:48:40 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: Thanks for raising this, I was going to ask. The last I heard it was baked into the 5.0 release of Scale but the release notes are eerily quiet on the matter. Would be good to get some input from IBM on this. Richard Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Greg.Lehmann at csiro.au Sent: Friday, March 2, 2018 3:48:44 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Mar 6 18:50:00 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 6 Mar 2018 18:50:00 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Mar 6 17:17:59 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 6 Mar 2018 17:17:59 +0000 Subject: [gpfsug-discuss] mmfind performance Message-ID: Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Mar 6 18:54:47 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 6 Mar 2018 18:54:47 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au>, Message-ID: The sales pitch my colleagues heard suggested it was already in v5.. That's a big shame to hear that we all misunderstood. Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Christof Schmitt Sent: Tuesday, March 6, 2018 6:50:00 PM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: Sent by: gpfsug-discuss-bounces at spectrumscale.org To: Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Mar 6 18:57:32 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 6 Mar 2018 18:57:32 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: , <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From dod2014 at med.cornell.edu Tue Mar 6 18:23:41 2018 From: dod2014 at med.cornell.edu (Douglas Duckworth) Date: Tue, 6 Mar 2018 13:23:41 -0500 Subject: [gpfsug-discuss] 100G RoCEE and Spectrum Scale Performance Message-ID: Hi We are currently running Spectrum Scale over FDR Infiniband. We plan on upgrading to EDR since I have not really encountered documentation saying to abandon the lower-latency advantage found in Infiniband. Our workloads generally benefit from lower latency. It looks like, ignoring GPFS, EDR still has higher throughput and lower latency when compared to 100G RoCEE. http://sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post149s2-file3.pdf However, I wanted to get feedback on how GPFS performs with 100G Ethernet instead of FDR. Thanks very much! Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Tue Mar 6 19:46:59 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 6 Mar 2018 20:46:59 +0100 Subject: [gpfsug-discuss] tscCmdPortRange question In-Reply-To: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> References: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> Message-ID: An HTML attachment was scrubbed... URL: From knop at us.ibm.com Tue Mar 6 23:11:38 2018 From: knop at us.ibm.com (Felipe Knop) Date: Tue, 6 Mar 2018 18:11:38 -0500 Subject: [gpfsug-discuss] tscCmdPortRange question In-Reply-To: References: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> Message-ID: Olaf, Correct. mmchconfig -i is accepted for tscCmdPortRange . The change should take place immediately, upon invocation of the next command. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Olaf Weiser" To: gpfsug main discussion list Date: 03/06/2018 02:47 PM Subject: Re: [gpfsug-discuss] tscCmdPortRange question Sent by: gpfsug-discuss-bounces at spectrumscale.org this parameter is just for administrative commands.. "where" to send the output of a command... and for those admin ports .. so called ephemeral ports... it depends , how much admin commands ( = sessions = sockets) you want to run in parallel in my experience.. 10 ports is more than enough we use those in a range from 50000-50010 to be clear .. demon - to - demon .. communication always uses 1191 cheers From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 03/06/2018 06:55 PM Subject: [gpfsug-discuss] tscCmdPortRange question Sent by: gpfsug-discuss-bounces at spectrumscale.org We are looking at setting a value for tscCmdPortRange so that we can apply firewalls to a small number of GPFS nodes in one of our clusters. The docs don?t give an indication on the number of ports that are required to be in the range. Could anyone make a suggestion on this? It doesn?t appear as a parameter for ?mmchconfig -i?, so I assume that it requires the nodes to be restarted, however I?m not clear if we could do a rolling restart on this? Thanks Simon_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=pezsJOeWDWSnEkh5d3dp175Vx4opvikABgoTzUt-9pQ&s=S_Qe62jYseR2Y2yjiovXwvVz3d2SFW-jCf0Pw5VB_f4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Mar 6 22:27:34 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 6 Mar 2018 17:27:34 -0500 Subject: [gpfsug-discuss] mmfind performance In-Reply-To: References: Message-ID: Please try: mmfind --polFlags '-N a_node_list -g /gpfs23/tmp' directory find-flags ... Where a_node_list is a node list of your choice and /gpfs23/tmp is a temp directory of your choice... And let us know how that goes. Also, you have chosen a special case, just looking for some inode numbers -- so find can skip stating the other inodes... whereas mmfind is not smart enough to do that -- but still with parallelism, I'd guess mmapplypolicy might still beat find in elapsed time to complete, even for this special case. -- Marc K of GPFS From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 03/06/2018 01:52 PM Subject: [gpfsug-discuss] mmfind performance Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=48WYhVkWI1kr_BM-Wg_VaXEOi7xfGusnZcJtkiA98zg&s=IXUhEC_thuGAVwGJ02oazCCnKEuAdGeg890fBelP4kE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Wed Mar 7 01:30:14 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 6 Mar 2018 20:30:14 -0500 Subject: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory In-Reply-To: <65453649-77df-2efa-8776-eb2775ca9efa@nasa.gov> References: <65453649-77df-2efa-8776-eb2775ca9efa@nasa.gov> Message-ID: Following up on this... On one of the nodes on which I'd bounced the pagepool around I managed to cause what appeared to that node as filesystem corruption (i/o errors and fsstruct errors) on every single fs. Thankfully none of the other nodes in the cluster seemed to agree that the fs was corrupt. I'll open a PMR on that but I thought it was interesting none the less. I haven't run an fsck on any of the filesystems but my belief is that they're OK since so far none of the other nodes in the cluster have complained. Secondly, I can see the pagepool allocations that align with registered verbs mr's (looking at mmfsadm dump verbs). In theory one can free an ib mr after registration as long as it's not in use but one has to track that and I could see that being a tricky thing (although in theory given the fact that GPFS has its own page allocator it might be relatively trivial to figure it out but it might also require re-establishing RDMA connections depending on whether or not a given QP is associated with a PD that uses the MR trying to be freed...I think that makes sense). Anyway, I'm wondering if the need to free the ib MR on pagepool shrink could be avoided all together by limiting the amount of memory that gets allocated to verbs MR's (e.g. something like verbsPagePoolMaxMB) so that those regions never need to be freed but the amount of memory available for user caching could grow and shrink as required. It's probably not that simple, though :) Another thought I had was doing something like creating a file in /dev/shm, registering it as a loopback device, and using that as an LROC device. I just don't think that's feasible at scale given the current method of LROC device registration (e.g. via the mmsdrfs file). I think there's much to be gained from the ability to dynamically change the memory-based file cache size on a per-job basis so I'm really hopeful we can find a way to make this work. -Aaron On 2/25/18 11:45 AM, Aaron Knister wrote: > Hmm...interesting. It sure seems to try :) > > The pmap command was this: > > pmap $(pidof mmfsd) | sort -n -k3 | tail > > -Aaron > > On 2/23/18 9:35 AM, IBM Spectrum Scale wrote: >> AFAIK you can increase the pagepool size dynamically but you cannot >> shrink it dynamically. ?To shrink it you must restart the GPFS daemon. >> Also, could you please provide the actual pmap commands you executed? >> >> Regards, The Spectrum Scale (GPFS) team >> >> ------------------------------------------------------------------------------------------------------------------ >> >> If you feel that your question can benefit other users of ?Spectrum >> Scale (GPFS), then please post it to the public IBM developerWroks >> Forum at >> https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please >> contact ??1-800-237-5511 in the United States or your local IBM >> Service Center in other countries. >> >> The forum is informally monitored as time permits and should not be >> used for priority messages to the Spectrum Scale (GPFS) team. >> >> >> >> From: Aaron Knister >> To: >> Date: 02/22/2018 10:30 PM >> Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all memory >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ------------------------------------------------------------------------ >> >> >> >> This is also interesting (although I don't know what it really means). >> Looking at pmap run against mmfsd I can see what happens after each step: >> >> # baseline >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020000000000 1048576K 1048576K 1048576K 1048576K ? ? ?0K rwxp [anon] >> Total: ? ? ? ? ? 1613580K 1191020K 1189650K 1171836K ? ? ?0K >> >> # tschpool 64G >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020000000000 67108864K 67108864K 67108864K 67108864K ?0K rwxp [anon] >> Total: ? ? ? ? ? 67706636K 67284108K 67282625K 67264920K ? ? ?0K >> >> # tschpool 1G >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020001400000 139264K 139264K 139264K 139264K ? ? ?0K rwxp [anon] >> 0000020fc9400000 897024K 897024K 897024K 897024K ? ? ?0K rwxp [anon] >> 0000020009c00000 66052096K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K rwxp [anon] >> Total: ? ? ? ? ? 67706636K 1223820K 1222451K 1204632K ? ? ?0K >> >> Even though mmfsd has that 64G chunk allocated there's none of it >> *used*. I wonder why Linux seems to be accounting it as allocated. >> >> -Aaron >> >> On 2/22/18 10:17 PM, Aaron Knister wrote: >> ?> I've been exploring the idea for a while of writing a SLURM SPANK >> plugin >> ?> to allow users to dynamically change the pagepool size on a node. >> Every >> ?> now and then we have some users who would benefit significantly from a >> ?> much larger pagepool on compute nodes but by default keep it on the >> ?> smaller side to make as much physmem available as possible to batch >> work. >> ?> >> ?> In testing, though, it seems as though reducing the pagepool doesn't >> ?> quite release all of the memory. I don't really understand it because >> ?> I've never before seen memory that was previously resident become >> ?> un-resident but still maintain the virtual memory allocation. >> ?> >> ?> Here's what I mean. Let's take a node with 128G and a 1G pagepool. >> ?> >> ?> If I do the following to simulate what might happen as various jobs >> ?> tweak the pagepool: >> ?> >> ?> - tschpool 64G >> ?> - tschpool 1G >> ?> - tschpool 32G >> ?> - tschpool 1G >> ?> - tschpool 32G >> ?> >> ?> I end up with this: >> ?> >> ?> mmfsd thinks there's 32G resident but 64G virt >> ?> # ps -o vsz,rss,comm -p 24397 >> ?> ??? VSZ?? RSS COMMAND >> ?> 67589400 33723236 mmfsd >> ?> >> ?> however, linux thinks there's ~100G used >> ?> >> ?> # free -g >> ?> total?????? used free???? shared??? buffers cached >> ?> Mem:?????????? 125 100???????? 25 0????????? 0 0 >> ?> -/+ buffers/cache: 98???????? 26 >> ?> Swap: 7????????? 0 7 >> ?> >> ?> I can jump back and forth between 1G and 32G *after* allocating 64G >> ?> pagepool and the overall amount of memory in use doesn't balloon but I >> ?> can't seem to shed that original 64G. >> ?> >> ?> I don't understand what's going on... :) Any ideas? This is with Scale >> ?> 4.2.3.6. >> ?> >> ?> -Aaron >> ?> >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=OrZQeEmI6chBdguG-h4YPHsxXZ4gTU3CtIuN4e3ijdY&s=hvVIRG5kB1zom2Iql2_TOagchsgl99juKiZfJt5S1tM&e= >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Greg.Lehmann at csiro.au Tue Mar 6 23:36:12 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Tue, 6 Mar 2018 23:36:12 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 7 13:45:24 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 7 Mar 2018 13:45:24 +0000 Subject: [gpfsug-discuss] mmfind performance Message-ID: <90F48570-7294-4032-8A6A-73DD51169A55@bham.ac.uk> I can?t comment on mmfind vs perl, but have you looked at trying ?tsfindinode? ? Simon From: on behalf of "Buterbaugh, Kevin L" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 6 March 2018 at 18:52 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] mmfind performance Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Mar 7 15:18:24 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 7 Mar 2018 15:18:24 +0000 Subject: [gpfsug-discuss] mmfind performance In-Reply-To: References: Message-ID: Hi Marc, Thanks, I?m going to give this a try as the first mmfind finally finished overnight, but produced no output: /root root at gpfsmgrb# bash -x ~/bin/klb.sh + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls /root root at gpfsmgrb# BTW, I had put that in a simple script simply because I had a list of those inodes and it was easier for me to get that in the format I wanted via a script that I was editing than trying to do that on the command line. However, in the log file it was producing it ?hit? on 48 files: [I] Inodes scan: 978275821 files, 99448202 directories, 37189547 other objects, 1967508 'skipped' files and/or errors. [I] 2018-03-06 at 23:43:15.988 Policy evaluation. 1114913570 files scanned. [I] 2018-03-06 at 23:43:16.016 Sorting 48 candidate file list records. [I] 2018-03-06 at 23:43:16.040 Sorting 48 candidate file list records. [I] 2018-03-06 at 23:43:16.065 Choosing candidate files. 0 records scanned. [I] 2018-03-06 at 23:43:16.066 Choosing candidate files. 48 records scanned. [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 48 1274453504 48 1274453504 0 RULE 'mmfind' LIST 'mmfindList' DIRECTORIES_PLUS SHOW(.) WHERE(.) [I] Filesystem objects with no applicable rules: 1112946014. [I] GPFS Policy Decisions and File Choice Totals: Chose to list 1274453504KB: 48 of 48 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 564722407424 624917749760 90.367477583% gpfs23data 304797672448 531203506176 57.378701177% system 0 0 0.000000000% (no user data) [I] 2018-03-06 at 23:43:16.066 Policy execution. 0 files dispatched. [I] 2018-03-06 at 23:43:16.102 Policy execution. 0 files dispatched. [I] A total of 0 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. While I?m going to follow your suggestion next, if you (or anyone else on the list) can explain why the ?Hit_Cnt? is 48 but the ?-ls? I passed to mmfind didn?t result in anything being listed, my curiosity is piqued. And I?ll go ahead and say it before someone else does ? I haven?t just chosen a special case, I AM a special case? ;-) Kevin On Mar 6, 2018, at 4:27 PM, Marc A Kaplan > wrote: Please try: mmfind --polFlags '-N a_node_list -g /gpfs23/tmp' directory find-flags ... Where a_node_list is a node list of your choice and /gpfs23/tmp is a temp directory of your choice... And let us know how that goes. Also, you have chosen a special case, just looking for some inode numbers -- so find can skip stating the other inodes... whereas mmfind is not smart enough to do that -- but still with parallelism, I'd guess mmapplypolicy might still beat find in elapsed time to complete, even for this special case. -- Marc K of GPFS From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 03/06/2018 01:52 PM Subject: [gpfsug-discuss] mmfind performance Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=48WYhVkWI1kr_BM-Wg_VaXEOi7xfGusnZcJtkiA98zg&s=IXUhEC_thuGAVwGJ02oazCCnKEuAdGeg890fBelP4kE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C724521c8034241913d8508d58412dcf8%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636560138922366489&sdata=faXozQ%2FGGDf8nARmk52%2B2W5eIEBfnYwNapJyH%2FagqIQ%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Mar 7 16:48:40 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 7 Mar 2018 17:48:40 +0100 Subject: [gpfsug-discuss] 100G RoCEE and Spectrum Scale Performance In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Mar 7 19:15:59 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 7 Mar 2018 14:15:59 -0500 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Wed Mar 7 21:53:34 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 7 Mar 2018 21:53:34 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Mar 8 09:41:56 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 8 Mar 2018 09:41:56 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: Whether or not you meant it your words ?that is not available today.? Implies that something is coming in the future? Would you be reliant on the Samba/CTDB development team or would you roll your own.. supposing it?s possible in the first place. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: 07 March 2018 21:54 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades The problem with the SMB upgrade is with the data shared between the protocol nodes. It is not tied to the protocol version used between SMB clients and the protocol nodes. Samba stores internal data (e.g. for the SMB state of open files) in tdb database files. ctdb then makes these tdb databases available across all protocol nodes. A concurrent upgrade for SMB would require correct handling of ctdb communications and the tdb records across multiple versions; that is not available today. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Wed, Mar 7, 2018 4:35 AM In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=7bQRypv0JL7swEYobmepaynWFvV8HYtBa2_S1kAyirk&s=2bUeeDk-8VbtwU8KtV4RlENEcbpOr_GQJQvR_8gt-ug&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Thu Mar 8 08:29:56 2018 From: john.hearns at asml.com (John Hearns) Date: Thu, 8 Mar 2018 08:29:56 +0000 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: On the subject of mmfind, I would like to find files which have the misc attributes relevant to AFM. For instance files which have the attribute 'v' The file is newly created, not yet copied to home I can write a policy to do this, and I have a relevant policy written. However I would like to do this using mmfind, which seems a nice general utility. This syntax does not work: mmfind /hpc -eaWithValue MISC_ATTRIBUTES===v Before anyone says it, I am mixing up MISC_ATTRIBUTES and extended attributes! My question really is - has anyone done this sort of serch using mmfind? Thankyou From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, March 07, 2018 8:16 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfind -ls and so forth As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Mar 8 11:10:24 2018 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 8 Mar 2018 11:10:24 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Message-ID: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Thu Mar 8 12:33:41 2018 From: david_johnson at brown.edu (david_johnson at brown.edu) Date: Thu, 8 Mar 2018 07:33:41 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active In-Reply-To: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> Message-ID: <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Mar 8 12:42:47 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 8 Mar 2018 07:42:47 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive In-Reply-To: <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Mar 8 13:59:27 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 8 Mar 2018 08:59:27 -0500 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: (John Hearns, et. al.) Some minor script hacking would be the easiest way add test(s) for other MISC_ATTRIBUTES Notice mmfind concentrates on providing the most popular classic(POSIX) and Linux predicates, BUT also adds a few gpfs specific predicates (mmfind --help show you these) -ea -eaWithValue -gpfsImmut -gpfsAppOnly Look at the implementation of -gpfsImmut in tr_findToPol.pl ... sub tr_gpfsImmut{ return "( /* -gpfsImmut */ MISC_ATTRIBUTES LIKE '%X%')"; } So easy to extend this for any or all the others.... True it's perl, but you don't have to be a perl expert to cut-paste-hack another predicate into the script. Let us know how you make out with this... Perhaps we shall add a general predicate -gpfsMiscAttrLike '...' to the next version... -- Marc K of GPFS From: John Hearns To: gpfsug main discussion list Date: 03/08/2018 04:59 AM Subject: Re: [gpfsug-discuss] mmfind -ls and so forth Sent by: gpfsug-discuss-bounces at spectrumscale.org On the subject of mmfind, I would like to find files which have the misc attributes relevant to AFM. For instance files which have the attribute ?v? The file is newly created, not yet copied to home I can write a policy to do this, and I have a relevant policy written. However I would like to do this using mmfind, which seems a nice general utility. This syntax does not work: mmfind /hpc -eaWithValue MISC_ATTRIBUTES===v Before anyone says it, I am mixing up MISC_ATTRIBUTES and extended attributes! My question really is ? has anyone done this sort of serch using mmfind? Thankyou From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, March 07, 2018 8:16 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfind -ls and so forth As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=LDC-t-w-jkuH2fJZ1lME_JUjzABDz3y90ptTlYWM3rc&s=xrFd1LD5dWq9GogfeOGs9ZCtqoptErjmGfJzD3eXhz4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Mar 8 15:16:10 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 8 Mar 2018 15:16:10 +0000 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: <8D4EED0B-A9F8-46FB-8BA2-359A3CF1C630@vanderbilt.edu> Hi Marc, I test in production ? just kidding. But - not kidding - I did read the entire mmfind.README, compiled the binary as described therein, and read the output of ?mmfind -h?. But what I forgot was that when you run a bash shell script with ?bash -x? it doesn?t show you the redirection you did to a file ? and since the mmfind ran for ~5 days, including over a weekend, and including Monday which I took off from work to have our 16 1/2 year old Siberian Husky put to sleep, I simply forgot that in the script itself I had redirected the output to a file. Stupid of me, I know, but unlike Delusional Donald, I?ll admit my stupid mistakes. Thanks, and sorry. I will try the mmfind as you suggested in your previous response the next time I need to run one to see if that significantly improves the performance? Kevin On Mar 7, 2018, at 1:15 PM, Marc A Kaplan > wrote: As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c170869f3294124be3608d5845fdecc%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636560469687764985&sdata=yNvpm34DY0AtEm2Y4OIMll5IW1v5kP3X3vHx3sQ%2B8Rs%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From oluwasijibomi.saula at ndsu.edu Thu Mar 8 15:06:03 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Thu, 8 Mar 2018 15:06:03 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Message-ID: Hi Folks, As this is my first post to the group, let me start by saying I applaud the commentary from the user group as it has been a resource to those of us watching from the sidelines. That said, we have a GPFS layered on IPoIB, and recently, we started having some issues on our IB FDR fabric which manifested when GPFS began sending persistent expel messages to particular nodes. Shortly after, we embarked on a tuning exercise using IBM tuning recommendations but this page is quite old and we've run into some snags, specifically with setting 4k MTUs using mlx4_core/mlx4_en module options. While setting 4k MTUs as the guide recommends is our general inclination, I'd like to solicit some advice as to whether 4k MTUs are a good idea and any hitch-free steps to accomplishing this. I'm getting some conflicting remarks from Mellanox support asking why we'd want to use 4k MTUs with Unreliable Datagram mode. Also, any pointers to best practices or resources for network configurations for heavy I/O clusters would be much appreciated. Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Thu Mar 8 17:37:12 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Thu, 8 Mar 2018 17:37:12 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Thu Mar 8 21:50:11 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Thu, 8 Mar 2018 21:50:11 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: Message-ID: <1520545811808.33125@UTSouthwestern.edu> Hi, Saula, Can the expelled node and expelling node ping each other? We expanded our gpfs IB network from /24 to /20 but some clients still used /24, they cannot talk to the added new clients using /20 and expelled the new clients persistently. Changing the netmask all to /20 works out. FYI. Wei Guo HPC Administartor UT Southwestern Medical Center wei1.guo at utsouthwestern.edu ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Thursday, March 8, 2018 11:37 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 74, Issue 17 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Thoughts on GPFS on IB & MTU sizes (Saula, Oluwasijibomi) 2. Re: wondering about outage free protocols upgrades (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Thu, 8 Mar 2018 15:06:03 +0000 From: "Saula, Oluwasijibomi" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Message-ID: Content-Type: text/plain; charset="windows-1252" Hi Folks, As this is my first post to the group, let me start by saying I applaud the commentary from the user group as it has been a resource to those of us watching from the sidelines. That said, we have a GPFS layered on IPoIB, and recently, we started having some issues on our IB FDR fabric which manifested when GPFS began sending persistent expel messages to particular nodes. Shortly after, we embarked on a tuning exercise using IBM tuning recommendations but this page is quite old and we've run into some snags, specifically with setting 4k MTUs using mlx4_core/mlx4_en module options. While setting 4k MTUs as the guide recommends is our general inclination, I'd like to solicit some advice as to whether 4k MTUs are a good idea and any hitch-free steps to accomplishing this. I'm getting some conflicting remarks from Mellanox support asking why we'd want to use 4k MTUs with Unreliable Datagram mode. Also, any pointers to best practices or resources for network configurations for heavy I/O clusters would be much appreciated. Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 8 Mar 2018 17:37:12 +0000 From: "Christof Schmitt" To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 74, Issue 17 ********************************************** ________________________________ UT Southwestern Medical Center The future of medicine, today. From Greg.Lehmann at csiro.au Fri Mar 9 00:23:10 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 9 Mar 2018 00:23:10 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <2b7547fd8aec467a958d8e10e88bd1e6@exch1-cdc.nexus.csiro.au> That last little bit ?not available today? gives me hope. It would be nice to get there ?one day.? Our situation is we are using NFS for access to images that VMs run from. An outage means shutting down a lot of guests. An NFS outage of even short duration would result in the system disks of VMs going read only due to IO timeouts. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Thursday, 8 March 2018 7:54 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades The problem with the SMB upgrade is with the data shared between the protocol nodes. It is not tied to the protocol version used between SMB clients and the protocol nodes. Samba stores internal data (e.g. for the SMB state of open files) in tdb database files. ctdb then makes these tdb databases available across all protocol nodes. A concurrent upgrade for SMB would require correct handling of ctdb communications and the tdb records across multiple versions; that is not available today. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Wed, Mar 7, 2018 4:35 AM In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=7bQRypv0JL7swEYobmepaynWFvV8HYtBa2_S1kAyirk&s=2bUeeDk-8VbtwU8KtV4RlENEcbpOr_GQJQvR_8gt-ug&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From willi.engeli at id.ethz.ch Fri Mar 9 12:21:27 2018 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Fri, 9 Mar 2018 12:21:27 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 Message-ID: Hello Group, I?ve just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. The protocol also brings important corrections and enhancements with it. I would like to ask you all very kindly to vote for this RFE please. You find it here: https://www.ibm.com/developerworks/rfe/execute Headline:NFS V4.1 Support ID:117398 Freundliche Gr?sse Willi Engeli ETH Zuerich ID Speicherdienste Weinbergstrasse 11 WEC C 18 8092 Zuerich Tel: +41 44 632 02 69 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From jonathan.buzzard at strath.ac.uk Fri Mar 9 12:37:22 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 09 Mar 2018 12:37:22 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> , <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <1520599042.1554.1.camel@strath.ac.uk> On Thu, 2018-03-08 at 09:41 +0000, Sobey, Richard A wrote: > Whether or not you meant it your words ?that is not available today.? > Implies that something is coming in the future? Would you be reliant > on the Samba/CTDB development team or would you roll your own.. > supposing it?s possible in the first place. > ? Back in the day when one had to roll your own Samba for this stuff, rolling Samba upgrades worked. What changed or was it never supported? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From stijn.deweirdt at ugent.be Fri Mar 9 12:42:50 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 9 Mar 2018 13:42:50 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> hi all, i would second this request to upvote this. the fact that 4.1 support was dropped in a subsubminor update (4.2.3.5 to 4.3.26 afaik) was already pretty bad to discover, but at the very least there should be an option to reenable it. i'm also interested why this was removed (or actively prevented to enable). i can understand that eg pnfs is not support, bu basic protocol features wrt HA are a must have. only with 4.1 are we able to do ces+ganesha failover without IO error, something that should be basic feature nowadays. stijn On 03/09/2018 01:21 PM, Engeli Willi (ID SD) wrote: > Hello Group, > > I?ve just created a request for enhancement (RFE) to have ganesha supporting > NFS V4.1. > > It is important, to have this new Protocol version supported, since our > Linux clients default support is more then 80% based in this version by > default and Linux distributions are actively pushing this Protocol. > > The protocol also brings important corrections and enhancements with it. > > > > I would like to ask you all very kindly to vote for this RFE please. > > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > > > > Freundliche Gr?sse > > > > Willi Engeli > > ETH Zuerich > > ID Speicherdienste > > Weinbergstrasse 11 > > WEC C 18 > > 8092 Zuerich > > > > Tel: +41 44 632 02 69 > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From Marcelo.Garcia at EMEA.NEC.COM Fri Mar 9 12:51:22 2018 From: Marcelo.Garcia at EMEA.NEC.COM (Marcelo Garcia) Date: Fri, 9 Mar 2018 12:51:22 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: Hi I got the following error when trying the URL below: {e: 'Exception usecase string is null'} Regards mg. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) Sent: Freitag, 9. M?rz 2018 13:21 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 Hello Group, I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. The protocol also brings important corrections and enhancements with it. I would like to ask you all very kindly to vote for this RFE please. You find it here: https://www.ibm.com/developerworks/rfe/execute Headline:NFS V4.1 Support ID:117398 Freundliche Gr?sse Willi Engeli ETH Zuerich ID Speicherdienste Weinbergstrasse 11 WEC C 18 8092 Zuerich Tel: +41 44 632 02 69 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Fri Mar 9 14:09:59 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 9 Mar 2018 15:09:59 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> hi marcelo, can you try https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117398 stijn On 03/09/2018 01:51 PM, Marcelo Garcia wrote: > Hi > > I got the following error when trying the URL below: > {e: 'Exception usecase string is null'} > > Regards > > mg. > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) > Sent: Freitag, 9. M?rz 2018 13:21 > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 > > Hello Group, > I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. > It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. > The protocol also brings important corrections and enhancements with it. > > I would like to ask you all very kindly to vote for this RFE please. > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > Freundliche Gr?sse > > Willi Engeli > ETH Zuerich > ID Speicherdienste > Weinbergstrasse 11 > WEC C 18 > 8092 Zuerich > > Tel: +41 44 632 02 69 > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From Marcelo.Garcia at EMEA.NEC.COM Fri Mar 9 14:11:35 2018 From: Marcelo.Garcia at EMEA.NEC.COM (Marcelo Garcia) Date: Fri, 9 Mar 2018 14:11:35 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> References: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> Message-ID: Hi stijn Now it's working. Cheers m. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Stijn De Weirdt Sent: Freitag, 9. M?rz 2018 15:10 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 hi marcelo, can you try https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117398 stijn On 03/09/2018 01:51 PM, Marcelo Garcia wrote: > Hi > > I got the following error when trying the URL below: > {e: 'Exception usecase string is null'} > > Regards > > mg. > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) > Sent: Freitag, 9. M?rz 2018 13:21 > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 > > Hello Group, > I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. > It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. > The protocol also brings important corrections and enhancements with it. > > I would like to ask you all very kindly to vote for this RFE please. > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > Freundliche Gr?sse > > Willi Engeli > ETH Zuerich > ID Speicherdienste > Weinbergstrasse 11 > WEC C 18 > 8092 Zuerich > > Tel: +41 44 632 02 69 > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Click https://www.mailcontrol.com/sr/NavEVlEkpX3GX2PQPOmvUqrlA1!9RTN2ec8I4RU35plgh6Q4vQM4vfVPrCpIvwaSEkP!v72X8H9IWrzEXY2ZCw== to report this email as spam. From ewahl at osc.edu Fri Mar 9 14:19:10 2018 From: ewahl at osc.edu (Edward Wahl) Date: Fri, 9 Mar 2018 09:19:10 -0500 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: Message-ID: <20180309091910.0334604a@osc.edu> Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From christof.schmitt at us.ibm.com Fri Mar 9 16:16:41 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 9 Mar 2018 16:16:41 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <1520599042.1554.1.camel@strath.ac.uk> References: <1520599042.1554.1.camel@strath.ac.uk>, <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From oluwasijibomi.saula at ndsu.edu Sat Mar 10 14:29:33 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Sat, 10 Mar 2018 14:29:33 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: <20180309091910.0334604a@osc.edu> References: , <20180309091910.0334604a@osc.edu> Message-ID: Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This experience prompted my tunning exercise on the node and has since resolved the expel messages to node even during times of high I/O activity. Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS may not be optimal. To answer your questions, Ed - IB supports both administrative and daemon communications, and we have verbsRdma configured. Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've been told this should not be the case. I'm hoping to settle on 4096 MTU nodes for the entire cluster but I fear there may be some caveats - any thoughts on this? (Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career with NDSU but he left us shortly after. Maybe like you I can tune up my Japanese as well once my GPFS issues are put to rest! ? ) Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: Edward Wahl Sent: Friday, March 9, 2018 8:19:10 AM To: gpfsug-discuss at spectrumscale.org Cc: Saula, Oluwasijibomi Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Sat Mar 10 16:31:36 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Sat, 10 Mar 2018 16:31:36 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: , <20180309091910.0334604a@osc.edu>, Message-ID: <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> Hi, Saula, This sounds like the problem with the jumbo frame. Ping or metadata query use small packets, so any time you can ping or ls file. However, data transferring use large packets, the MTU size. Your MTU 65536 nodes send out large packets, but they get dropped to the 2044 nodes, because the packet size cannot fit in 2044 size limit. The reverse is ok. I think the gpfs client nodes always communicate with each other to sync the sdr repo files, or other user job mpi communications if there are any. I think all the nodes should agree on a single MTU. I guess ipoib supports up to 4096. I might missed your Ethernet network switch part whether jumbo frame is enabled or not, if you are using any. Wei Guo On Sat, Mar 10, 2018 at 8:29 AM -0600, "Saula, Oluwasijibomi" > wrote: Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This experience prompted my tunning exercise on the node and has since resolved the expel messages to node even during times of high I/O activity. Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS may not be optimal. To answer your questions, Ed - IB supports both administrative and daemon communications, and we have verbsRdma configured. Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've been told this should not be the case. I'm hoping to settle on 4096 MTU nodes for the entire cluster but I fear there may be some caveats - any thoughts on this? (Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career with NDSU but he left us shortly after. Maybe like you I can tune up my Japanese as well once my GPFS issues are put to rest! ? ) Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: Edward Wahl Sent: Friday, March 9, 2018 8:19:10 AM To: gpfsug-discuss at spectrumscale.org Cc: Saula, Oluwasijibomi Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 ________________________________ UT Southwestern Medical Center The future of medicine, today. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Sat Mar 10 16:57:49 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 11:57:49 -0500 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> References: <20180309091910.0334604a@osc.edu> <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> Message-ID: <8fff8715-e67f-b048-f37d-2498c0cac2f7@nasa.gov> I, personally, haven't been burned by mixing UD and RC IPoIB clients on the same fabric but that doesn't mean it can't happen. What I *have* been bitten by a couple times is not having enough entries in the arp cache after bringing a bunch of new nodes online (that made for a long Christmas Eve one year...). You can toggle that via the gc_thresh settings. These settings work for ~3700 nodes (and technically could go much higher). net.ipv4.neigh.default.gc_thresh3 = 10240 net.ipv4.neigh.default.gc_thresh2 = 9216 net.ipv4.neigh.default.gc_thresh1 = 8192 It's the kind of thing that will bite you when you expand the cluster and it may make sense that it's exacerbated by metadata operations because those may require initiating connections to many nodes in the cluster which could blow your arp cache. -Aaron On 3/10/18 11:31 AM, Wei Guo wrote: > Hi, Saula, > > This sounds like the problem with the jumbo frame. > > Ping or metadata query use small packets, so any time you can ping or ls > file. > > However, data transferring use large packets, the MTU size. Your MTU > 65536 nodes send out large packets, but they get dropped to the 2044 > nodes, because the packet size cannot fit in 2044 size limit. The > reverse is ok. > > I think the gpfs client nodes always communicate with each other to sync > the sdr repo files, or other user job mpi communications if there are > any. I think all the nodes should agree on a single MTU. I guess ipoib > supports up to 4096. > > I might missed your Ethernet network switch part whether jumbo frame is > enabled or not, if you are using any. > > Wei Guo > > > > > > > On Sat, Mar 10, 2018 at 8:29 AM -0600, "Saula, Oluwasijibomi" > > wrote: > > Wei -? So the expelled node could ping the rest of the cluster just > fine. In fact, after adding this new node to the cluster I could > traverse the filesystem for simple lookups, however, heavy data > moves in or out of the filesystem seemed to trigger the expel > messages to the new node. > > > This experience prompted my?tunning exercise on the node and has > since resolved the expel messages to node even during times of high > I/O activity. > > > Nevertheless, I still have this nagging feeling that the IPoIB > tuning for GPFS may not be optimal. > > > To answer your questions,?Ed - IB supports both administrative and > daemon communications, and we have verbsRdma configured. > > > Currently, we have both 2044 and 65520 MTU nodes on our IB network > and I've been told this should not be the case. I'm hoping to settle > on 4096 MTU nodes for the entire cluster but I fear there may be > some caveats - any thoughts on this? > > > (Oh, Ed - Hideaki was my mentor for a short while when I began my > HPC career with NDSU but he left us shortly after. Maybe like you I > can tune up my Japanese as well once my GPFS issues are put to rest! > ? ) > > > Thanks, > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > Research 2 > Building > ?? > Room 220B > Dept 4100, PO Box 6050? / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu > ?| > www.ndsu.edu > > ------------------------------------------------------------------------ > *From:* Edward Wahl > *Sent:* Friday, March 9, 2018 8:19:10 AM > *To:* gpfsug-discuss at spectrumscale.org > *Cc:* Saula, Oluwasijibomi > *Subject:* Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes > > Welcome to the list. > > If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des > ne?" for me. > Though I recall he may have left. > > > A couple of questions as I, unfortunately, have a good deal of expel > experience. > > -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" > > -Are you using the IB as the administrative IP network? > > -As Wei asked, can nodes sending the expel requests ping the victim over > whatever interface is being used administratively?? Other interfaces > do NOT > matter for expels. Nodes that cannot even mount the file systems can > still > request expels.? Many many things can cause issues here from routing and > firewalls to bad switch software which will not update ARP tables, > and you get > nodes trying to expel each other. > > -are your NSDs logging the expels in /tmp/mmfs?? You can mmchconfig > expelDataCollectionDailyLimit if you need more captures to narrow > down what is > happening outside the mmfs.log.latest.? Just be wary of the disk > space if you > have "expel storms". > > -That tuning page is very out of date and appears to be mostly > focused on GPFS > 3.5.x tuning.?? While there is also a Spectrum Scale wiki, it's > Linux tuning > page does not appear to be kernel and network focused and is dated > even older. > > > Ed > > > > On Thu, 8 Mar 2018 15:06:03 +0000 > "Saula, Oluwasijibomi" wrote: > > > Hi Folks, > > > > > > As this is my first post to the group, let me start by saying I applaud the > > commentary from the user group as it has been a resource to those of us > > watching from the sidelines. > > > > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > > some issues on our IB FDR fabric which manifested when GPFS began sending > > persistent expel messages to particular nodes. > > > > > > Shortly after, we embarked on a tuning exercise using IBM tuning > > recommendations > > but this page is quite old and we've run into some snags, specifically with > > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > > like to solicit some advice as to whether 4k MTUs are a good idea and any > > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > > Datagram mode. > > > > > > Also, any pointers to best practices or resources for network configurations > > for heavy I/O clusters would be much appreciated. > > > > > > Thanks, > > > > Siji Saula > > HPC System Administrator > > Center for Computationally Assisted Science & Technology > > NORTH DAKOTA STATE UNIVERSITY > > > > > > Research 2 > > Building > > ? Room 220B Dept 4100, PO Box 6050? / Fargo, ND 58108-6050 p:701.231.7749 > > www.ccast.ndsu.edu | > > www.ndsu.edu > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > > > ------------------------------------------------------------------------ > > UTSouthwestern > > Medical Center > > The future of medicine, today. > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Sat Mar 10 21:39:28 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 16:39:28 -0500 Subject: [gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17) Message-ID: <96bf7c94-f5ee-c046-d835-de500bd20c51@nasa.gov> Hey All, I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap now takes a really long time as in... a *really* long time. Digging into it I can see that the snap command is actually done but the sshd child is left waiting on a sleep process on the clients (a sleep 600 at that). Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10 minutes looks like it'll take a good 10 hours. It seems the trouble is in the runCommand function in gpfs.snap. The function creates a child process to act as a sort of alarm to kill the specified command if it exceeds the timeout. The problem while the alarm process gets killed the kill signal isn't passed to the sleep process (because the sleep command is run as a process inside the "alarm" child shell process). In gpfs.snap changing this: [[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1 to this: [[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants $sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1 seems to fix the behavior. I'll open a PMR for this shortly but I'm just wondering if anyone else has seen this. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Sat Mar 10 21:44:39 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 16:44:39 -0500 Subject: [gpfsug-discuss] spontaneous tracing? Message-ID: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> I found myself with a little treat this morning to the tune of tracing running on the entire cluster of 3500 nodes. There were no logs I could find to indicate *why* the tracing had started but it was clear it was initiated by the cluster manager. Some sleuthing (thanks, collectl!) allowed me to figure out that the tracing started as the command: /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmcommon notifyOverload _asmgr I thought that running "mmchocnfig deadlockOverloadThreshold=0 -i" would stop this from happening again but lo and behold tracing kicked off *again* (with the same caller) some time later even after setting that parameter. What's odd is there are no log events to indicate an overload occurred. Has anyone seen similar behavior? We're on 4.2.3.6 efix17. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From mnaineni at in.ibm.com Mon Mar 12 09:54:50 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Mon, 12 Mar 2018 09:54:50 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> References: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be>, Message-ID: An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Mon Mar 12 10:01:15 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Mon, 12 Mar 2018 11:01:15 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> Message-ID: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be> hi malahal, we already figured that out but were hesitant to share it in case ibm wanted to remove this loophole. but can we assume that manuanlly editing the ganesha.conf and pushing it to ccr is supported? the config file is heavily edited / rewritten when certain mm commands, so we want to make sure we can always do this. it would be even better if the main.conf that is generated/edited by the ccr commands just had an include statement so we can edit another file locally instead of doing mmccr magic. stijn On 03/12/2018 10:54 AM, Malahal R Naineni wrote: > Upstream Ganesha code allows all NFS versions including NFSv4.2. Most Linux > clients were defaulting to NFSv4.0, but now they started using NFS4.1 which IBM > doesn't support. To avoid people accidentally using NFSv4.1, we decided to > remove it by default. > We don't support NFSv4.1, so there is no spectrum command to enable NFSv4.1 > support with PTF6. Of course, if you are familiar with mmccr, you can change the > config and let it use NFSv4.1 but any issues with NFS4.1 will go to /dev/null. :-) > You need to add "minor_versions = 0,1;" to NFSv4{} block > in /var/mmfs/ces/nfs-config/gpfs.ganesha.main.conf to allow NFSv4.0 and NFsv4.1, > and make sure you use mmccr command to make this change permanent. > Regards, Malahal. > > ----- Original message ----- > From: Stijn De Weirdt > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: Re: [gpfsug-discuss] asking for your vote for an RFE to support NFS > V4.1 > Date: Fri, Mar 9, 2018 6:13 PM > hi all, > > i would second this request to upvote this. the fact that 4.1 support > was dropped in a subsubminor update (4.2.3.5 to 4.3.26 afaik) was > already pretty bad to discover, but at the very least there should be an > option to reenable it. > > i'm also interested why this was removed (or actively prevented to > enable). i can understand that eg pnfs is not support, bu basic protocol > features wrt HA are a must have. > only with 4.1 are we able to do ces+ganesha failover without IO error, > something that should be basic feature nowadays. > > stijn > > On 03/09/2018 01:21 PM, Engeli Willi (ID SD) wrote: > > Hello Group, > > > > I?ve just created a request for enhancement (RFE) to have ganesha supporting > > NFS V4.1. > > > > It is important, to have this new Protocol version supported, since our > > Linux clients default support is more then 80% based in this version by > > default and Linux distributions are actively pushing this Protocol. > > > > The protocol also brings important corrections and enhancements with it. > > > > > > > > I would like to ask you all very kindly to vote for this RFE please. > > > > You find it here: https://www.ibm.com/developerworks/rfe/execute > > > > Headline:NFS V4.1 Support > > > > ID:117398 > > > > > > > > > > > > Freundliche Gr?sse > > > > > > > > Willi Engeli > > > > ETH Zuerich > > > > ID Speicherdienste > > > > Weinbergstrasse 11 > > > > WEC C 18 > > > > 8092 Zuerich > > > > > > > > Tel: +41 44 632 02 69 > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIF-g&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=yq4xoVKCPWQTqZVp0BgG8fBpXrS2FehGlAua1Eixci4&s=9DJi6qkF4eRc81vv6SlC3gxKL9oJJ4efkktzNaZAnkA&e= > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIF-g&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=yq4xoVKCPWQTqZVp0BgG8fBpXrS2FehGlAua1Eixci4&s=9DJi6qkF4eRc81vv6SlC3gxKL9oJJ4efkktzNaZAnkA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From xhejtman at ics.muni.cz Mon Mar 12 14:51:05 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Mar 2018 15:51:05 +0100 Subject: [gpfsug-discuss] Preferred NSD Message-ID: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek From scale at us.ibm.com Mon Mar 12 15:13:00 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Mon, 12 Mar 2018 09:13:00 -0600 Subject: [gpfsug-discuss] spontaneous tracing? In-Reply-To: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> References: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> Message-ID: /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be started. One can verify that using the underlying command being called as shown in the following example with /tmp/n containing node names one each line that will get the notification and the IP address being the file system manager from which the command is issued. /usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8 The only case that deadlock detection code will initiate tracing is that debugDataControl is set to "heavy" and tracing is not started. Then on deadlock detection tracing is turned on for 20 seconds and turned off. That can be tested using command like /usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8 And then mmfs.log will tell you what's going on. That's not a silent action. 2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock notification from 192.168.117.131 2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug data on this node. 2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing started Trace started: Wait 20 seconds before cut and stop trace 2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped 20 seconds later mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0 mmtrace: formatting /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz > What's odd is there are no log events to indicate an overload occurred. Overload msg is only seen in mmfs.log when debugDataControl is "heavy". mmdiag --deadlock shows overload related info starting from 4.2.3. # mmdiag --deadlock === mmdiag: deadlock === Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for short waiters Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on c69bc2xn01 is 0.01812 <== -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Mar 12 15:14:10 2018 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Mon, 12 Mar 2018 15:14:10 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: Hi Lukas, Check out FPO mode. That mimics Hadoop?s data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero?s NVMesh (note: not an endorsement since I can?t give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I?m not sure if they?ve released that feature yet but in theory it will give better fault tolerance *and* you?ll get more efficient usage of your SSDs. I?m sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Mar 12 15:18:40 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 12 Mar 2018 11:18:40 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: <188417.1520867920@turing-police.cc.vt.edu> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > I don't think like 5 or more data/metadata replicas are practical here. On the > other hand, multiple node failures is something really expected. Umm.. do I want to ask *why*, out of only 60 nodes, multiple node failures are an expected event - to the point that you're thinking about needing 5 replicas to keep things running? From xhejtman at ics.muni.cz Mon Mar 12 15:23:17 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Mar 2018 16:23:17 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <188417.1520867920@turing-police.cc.vt.edu> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> Message-ID: <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: > On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > > I don't think like 5 or more data/metadata replicas are practical here. On the > > other hand, multiple node failures is something really expected. > > Umm.. do I want to ask *why*, out of only 60 nodes, multiple node > failures are an expected event - to the point that you're thinking > about needing 5 replicas to keep things running? as of my experience with cluster management, we have multiple nodes down on regular basis. (HW failure, SW maintenance and so on.) I'm basically thinking that 2-3 replicas might not be enough while 5 or more are becoming too expensive (both disk space and required bandwidth being scratch space - high i/o load expected). -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From mnaineni at in.ibm.com Mon Mar 12 17:41:41 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Mon, 12 Mar 2018 17:41:41 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be> References: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be>, <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From Philipp.Rehs at uni-duesseldorf.de Mon Mar 12 20:09:14 2018 From: Philipp.Rehs at uni-duesseldorf.de (Philipp Helo Rehs) Date: Mon, 12 Mar 2018 21:09:14 +0100 Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8@uni-duesseldorf.de> Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 From zmance at ucar.edu Mon Mar 12 22:10:06 2018 From: zmance at ucar.edu (Zachary Mance) Date: Mon, 12 Mar 2018 16:10:06 -0600 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Since I am testing out remote mounting with EDR IB routers, I'll add to the discussion. In my lab environment I was seeing the same rdma connections being established and then disconnected shortly after. The remote filesystem would eventually mount on the clients, but it look a quite a while (~2mins). Even after mounting, accessing files or any metadata operations would take a while to execute, but eventually it happened. After enabling verbsRdmaCm, everything mounted just fine and in a timely manner. Spectrum Scale was using the librdmacm.so library. I would first double check that you have both clusters able to talk to each other on their IPoIB address, then make sure you enable verbsRdmaCm on both clusters. --------------------------------------------------------------------------------------------------------------- Zach Mance zmance at ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Thu, Mar 1, 2018 at 1:41 AM, John Hearns wrote: > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and rely > heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would appreciate some > more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my cost!) > but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, > but pulled it out of our environment as adding unneeded complexity. It > requires provisioning IP addresses across the Infiniband infrastructure and > possibly adding routers to other portions of the IP infrastructure. It was > also confusing some users due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their support for > GPFS over IB and they kept directing us to using CM (which still didn't > work). CM wasn't necessary once we found out about the actual problem (we > needed the undocumented verbsRdmaUseGidIndexZero configuration option among > other things due to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it. It sounds like the OP is keeping > IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister > > Reply-To: gpfsug main discussion list > > > > To: gpfsug-discuss at spectrumscale.org > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA CM to > work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the network > 192.168.11.0/24. > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > > are connected to a second infiniband network. These servers have IPs > > > on their IB interfaces in the network 192.168.12.0/24. > > > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug- > discuss&data=01%7C01%7Cjohn.hearns%40asml.com% > 7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad > 61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP > 8%3D&reserved=0 > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Tue Mar 13 03:06:34 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Tue, 13 Mar 2018 03:06:34 +0000 Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR (Philipp Helo Rehs) Message-ID: <7b8dd0540c4542668f24c1a20c7aee76@SWMS13MAIL10.swmed.org> Hi, Philipp, FYI. We had exactly the same IBV_WC_RETRY_EXC_ERR error message in our gpfs client log along with other client error kernel: ib0: ipoib_cm_handle_tx_wc_rss: failed cm send event (status=12, wrid=83 vend_err 81) in the syslog. The root cause was a bad IB cable connecting a leaf switch to the core switch where the client used as route. When we changed a new cable, the problem was solved and no more errors. We don't really have ipoib setup. The problem might be different from yours, but does the error message suggest that when your gpfs daemon tries to use mlx5_1, the packets are discarded so no connection? Did you do an IB bonding? Wei Guo HPC Administrator UTSW Message: 1 Date: Mon, 12 Mar 2018 21:09:14 +0100 From: Philipp Helo Rehs To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8 at uni-duesseldorf.de> Content-Type: text/plain; charset=utf-8 Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 ________________________________ UT Southwestern Medical Center The future of medicine, today. From aaron.s.knister at nasa.gov Tue Mar 13 04:49:33 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 13 Mar 2018 00:49:33 -0400 Subject: [gpfsug-discuss] spontaneous tracing? In-Reply-To: References: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> Message-ID: Thanks! I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon notifyOverload" does in fact start tracing for me on one of our clusters (technically 2, one in dev, one in prod). It did *not* start it on another test cluster. It looks to me like the difference is the mmsdrservport settings. On clusters where it's set to 0 tracing *does* start. On clusters where it's set to the default of 1191 (didn't try any other value) tracing *does not* start. I can toggle the behavior by changing the value of mmsdrservport back and forth. I do have a PMR open for this so I'll follow up there too. Thanks again for the help. -Aaron On 3/12/18 11:13 AM, IBM Spectrum Scale wrote: > /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be > started. ?One can verify that using the underlying command being called > as shown in the following example with /tmp/n containing node names one > each line that will get the notification and the IP address being the > file system manager from which the command is issued. > > */usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8* > > The only case that deadlock detection code will initiate tracing is that > debugDataControl is set to "heavy" and tracing is not started. Then on > deadlock detection tracing is turned on for 20 seconds and turned off. > > That can be tested using command like > */usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8* > > And then mmfs.log will tell you what's going on. That's not a silent action. > > *2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock > notification from 192.168.117.131* > *2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug > data on this node.* > *2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing > started* > *Trace started: Wait 20 seconds before cut and stop trace* > *2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped > 20 seconds later* > *mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 > /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0* > *mmtrace: formatting > /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to > /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz* > > > What's odd is there are no log events to indicate an overload occurred. > > Overload msg is only seen in mmfs.log when debugDataControl is "heavy". > mmdiag --deadlock shows overload related info starting from 4.2.3. > > *# mmdiag --deadlock* > > *=== mmdiag: deadlock ===* > > *Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds* > *Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for > short waiters* > > *Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on > c69bc2xn01 is 0.01812 <==* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From john.hearns at asml.com Tue Mar 13 10:37:43 2018 From: john.hearns at asml.com (John Hearns) Date: Tue, 13 Mar 2018 10:37:43 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Mar 13 14:16:30 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Mar 2018 15:16:30 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > Lukas, > It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. > > * I'm thinking about the following setup: > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > There is nothing wrong with this concept, for instance see > https://www.beegfs.io/wiki/BeeOND > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] > Sent: Monday, March 12, 2018 4:14 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD > > Hi Lukas, > > Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. > > You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. > > I'm sure there are other ways to skin this cat too. > > -Aaron > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: > Hello, > > I'm thinking about the following setup: > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each > SSDs as on NSD. > > I don't think like 5 or more data/metadata replicas are practical here. On the > other hand, multiple node failures is something really expected. > > Is there a way to instrument that local NSD is strongly preferred to store > data? I.e. node failure most probably does not result in unavailable data for > the other nodes? > > Or is there any other recommendation/solution to build shared scratch with > GPFS in such setup? (Do not do it including.) > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From jan.sundermann at kit.edu Tue Mar 13 14:35:36 2018 From: jan.sundermann at kit.edu (Jan Erik Sundermann) Date: Tue, 13 Mar 2018 15:35:36 +0100 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hi John We try to route infiniband traffic. The IP traffic is routed separately. The two clusters we try to connect are configured differently, one with IP over IB the other one with dedicated ethernet adapters. Jan Erik On 02/27/2018 10:17 AM, John Hearns wrote: > Jan Erik, > Can you clarify if you are routing IP traffic between the two Infiniband networks. > Or are you routing Infiniband traffic? > > > If I can be of help I manage an Infiniband network which connects to other IP networks using Mellanox VPI gateways, which proxy arp between IB and Ethernet. But I am not running GPFS traffic over these. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sundermann, Jan Erik (SCC) > Sent: Monday, February 26, 2018 5:39 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] Problems with remote mount via routed IB > > > Dear all > > we are currently trying to remote mount a file system in a routed Infiniband test setup and face problems with dropped RDMA connections. The setup is the following: > > - Spectrum Scale Cluster 1 is setup on four servers which are connected to the same infiniband network. Additionally they are connected to a fast ethernet providing ip communication in the network 192.168.11.0/24. > > - Spectrum Scale Cluster 2 is setup on four additional servers which are connected to a second infiniband network. These servers have IPs on their IB interfaces in the network 192.168.12.0/24. > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated machine. > > - We have a dedicated IB hardware router connected to both IB subnets. > > > We tested that the routing, both IP and IB, is working between the two clusters without problems and that RDMA is working fine both for internal communication inside cluster 1 and cluster 2 > > When trying to remote mount a file system from cluster 1 in cluster 2, RDMA communication is not working as expected. Instead we see error messages on the remote host (cluster 2) > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2 > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2 > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 3 > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3 > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 1 > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1 > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 1 > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 0 > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0 > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 0 > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 2 > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2 > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2 > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 3 > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3 > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > > > and in the cluster with the file system (cluster 1) > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > > > > Any advice on how to configure the setup in a way that would allow the remote mount via routed IB would be very appreciated. > > > Thank you and best regards > Jan Erik > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Jan Erik Sundermann Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 26191 Email: jan.sundermann at kit.edu www.scc.kit.edu KIT ? The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5382 bytes Desc: S/MIME Cryptographic Signature URL: From Robert.Oesterlin at nuance.com Tue Mar 13 14:42:24 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 13 Mar 2018 14:42:24 +0000 Subject: [gpfsug-discuss] SSUG USA Spring Meeting - Registration and call for speakers is now open! Message-ID: <1289B944-B4F5-40E8-861C-33423B318457@nuance.com> The registration for the Spring meeting of the SSUG-USA is now open. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Please note that we have limited meeting space so please register only if you?re sure you can attend. Detailed agenda will be published in the coming weeks. If you are interested in presenting, please contact me. I do have several speakers lined up already, but we can use a few more. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From jan.sundermann at kit.edu Tue Mar 13 15:24:13 2018 From: jan.sundermann at kit.edu (Jan Erik Sundermann) Date: Tue, 13 Mar 2018 16:24:13 +0100 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hello Zachary We are currently changing out setup to have IP over IB on all machines to be able to enable verbsRdmaCm. According to Mellanox (https://community.mellanox.com/docs/DOC-2384) ibacm requires pre-populated caches to be distributed to all end hosts with the mapping of IP to the routable GIDs (of both IB subnets). Was this also required in your successful deployment? Best Jan Erik On 03/12/2018 11:10 PM, Zachary Mance wrote: > Since I am testing out remote mounting with EDR IB routers, I'll add to > the discussion. > > In my lab environment I was seeing the same rdma connections being > established and then disconnected shortly after. The remote filesystem > would eventually mount on the clients, but it look a quite a while > (~2mins). Even after mounting, accessing files or any metadata > operations would take a while to execute, but eventually it happened. > > After enabling verbsRdmaCm, everything mounted just fine and in a timely > manner. Spectrum Scale was using the?librdmacm.so library. > > I would first double check that you have both clusters able to talk to > each other on their IPoIB address, then make sure you enable verbsRdmaCm > on both clusters. > > > --------------------------------------------------------------------------------------------------------------- > Zach Mance zmance at ucar.edu ?(303) 497-1883 > HPC Data Infrastructure Group?/ CISL / NCAR > --------------------------------------------------------------------------------------------------------------- > > > On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > wrote: > > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and > rely heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would > appreciate some more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my > cost!) but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > > [mailto:gpfsug-discuss-bounces at spectrumscale.org > ] On Behalf Of > Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB.? We did run IPoIB years > ago, but pulled it out of our environment as adding unneeded > complexity.? It requires provisioning IP addresses across the > Infiniband infrastructure and possibly adding routers to other > portions of the IP infrastructure.? It was also confusing some users > due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their > support for GPFS over IB and they kept directing us to using CM > (which still didn't work).? CM wasn't necessary once we found out > about the actual problem (we needed the undocumented > verbsRdmaUseGidIndexZero configuration option among other things due > to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it.? It sounds like the OP is > keeping IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister > > > Reply-To: gpfsug main discussion list > > > > > To: gpfsug-discuss at spectrumscale.org > > > Subject: Re: [gpfsug-discuss] Problems with remote mount via > routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA > CM to work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the > network 192.168.11.0/24 . > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers > which > > > are connected to a second infiniband network. These servers > have IPs > > > on their IB interfaces in the network 192.168.12.0/24 > . > > > > > > - IP is routed between 192.168.11.0/24 > and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB > subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine > both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > > fsug.org > %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com > %7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > > ug.org > %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com > %7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but > never lost! > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --? Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 > > -- The information contained in this communication and any > attachments is confidential and may be privileged, and is for the > sole use of the intended recipient(s). Any unauthorized review, use, > disclosure or distribution is prohibited. Unless explicitly stated > otherwise in the body of this communication or the attachment > thereto (if any), the information is provided on an AS-IS basis > without any express or implied warranties or liabilities. To the > extent you are relying on this information, you are doing so at your > own risk. If you are not the intended recipient, please notify the > sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor > the company/group of companies he or she represents shall be liable > for the proper and complete transmission of the information > contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Jan Erik Sundermann Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 26191 Email: jan.sundermann at kit.edu www.scc.kit.edu KIT ? The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5382 bytes Desc: S/MIME Cryptographic Signature URL: From alex at calicolabs.com Tue Mar 13 17:48:21 2018 From: alex at calicolabs.com (Alex Chekholko) Date: Tue, 13 Mar 2018 10:48:21 -0700 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> Message-ID: Hi Lukas, I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek wrote: > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > Lukas, > > It looks like you are proposing a setup which uses your compute servers > as storage servers also? > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > servers.. Using them as a shared scratch area with GPFS is one of the > options. > > > > > * I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > There is nothing wrong with this concept, for instance see > > https://www.beegfs.io/wiki/BeeOND > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > You should look at "failure zones" also. > > you still need the storage servers and local SSDs to use only for caching, > do > I understand correctly? > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > Sent: Monday, March 12, 2018 4:14 PM > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > Hi Lukas, > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > can have up to 3 replicas both data and metadata but still the downside, > though, as you say is the wrong node failures will take your cluster down. > > > > You might want to check out something like Excelero's NVMesh (note: not > an endorsement since I can't give such things) which can create logical > volumes across all your NVMe drives. The product has erasure coding on > their roadmap. I'm not sure if they've released that feature yet but in > theory it will give better fault tolerance *and* you'll get more efficient > usage of your SSDs. > > > > I'm sure there are other ways to skin this cat too. > > > > -Aaron > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: > > Hello, > > > > I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > I would like to setup shared scratch area using GPFS and those NVMe > SSDs. Each > > SSDs as on NSD. > > > > I don't think like 5 or more data/metadata replicas are practical here. > On the > > other hand, multiple node failures is something really expected. > > > > Is there a way to instrument that local NSD is strongly preferred to > store > > data? I.e. node failure most probably does not result in unavailable > data for > > the other nodes? > > > > Or is there any other recommendation/solution to build shared scratch > with > > GPFS in such setup? (Do not do it including.) > > > > -- > > Luk?? Hejtm?nek > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the body > of this communication or the attachment thereto (if any), the information > is provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zmance at ucar.edu Tue Mar 13 19:38:48 2018 From: zmance at ucar.edu (Zachary Mance) Date: Tue, 13 Mar 2018 13:38:48 -0600 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hi Jan, I am NOT using the pre-populated cache that mellanox refers to in it's documentation. After chatting with support, I don't believe that's necessary anymore (I didn't get a straight answer out of them). For the subnet prefix, make sure to use one from the range 0xfec0000000000000-0xfec000000000001f. --------------------------------------------------------------------------------------------------------------- Zach Mance zmance at ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Tue, Mar 13, 2018 at 9:24 AM, Jan Erik Sundermann wrote: > Hello Zachary > > We are currently changing out setup to have IP over IB on all machines to > be able to enable verbsRdmaCm. > > According to Mellanox (https://community.mellanox.com/docs/DOC-2384) > ibacm requires pre-populated caches to be distributed to all end hosts with > the mapping of IP to the routable GIDs (of both IB subnets). Was this also > required in your successful deployment? > > Best > Jan Erik > > > > On 03/12/2018 11:10 PM, Zachary Mance wrote: > >> Since I am testing out remote mounting with EDR IB routers, I'll add to >> the discussion. >> >> In my lab environment I was seeing the same rdma connections being >> established and then disconnected shortly after. The remote filesystem >> would eventually mount on the clients, but it look a quite a while >> (~2mins). Even after mounting, accessing files or any metadata operations >> would take a while to execute, but eventually it happened. >> >> After enabling verbsRdmaCm, everything mounted just fine and in a timely >> manner. Spectrum Scale was using the librdmacm.so library. >> >> I would first double check that you have both clusters able to talk to >> each other on their IPoIB address, then make sure you enable verbsRdmaCm on >> both clusters. >> >> >> ------------------------------------------------------------ >> --------------------------------------------------- >> Zach Mance zmance at ucar.edu (303) 497-1883 >> HPC Data Infrastructure Group / CISL / NCAR >> ------------------------------------------------------------ >> --------------------------------------------------- >> >> On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > > wrote: >> >> In reply to Stuart, >> our setup is entirely Infiniband. We boot and install over IB, and >> rely heavily on IP over Infiniband. >> >> As for users being 'confused' due to multiple IPs, I would >> appreciate some more depth on that one. >> Sure, all batch systems are sensitive to hostnames (as I know to my >> cost!) but once you get that straightened out why should users care? >> I am not being aggressive, just keen to find out more. >> >> >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> >> [mailto:gpfsug-discuss-bounces at spectrumscale.org >> ] On Behalf Of >> Stuart Barkley >> Sent: Wednesday, February 28, 2018 6:50 PM >> To: gpfsug main discussion list > > >> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB >> >> The problem with CM is that it seems to require configuring IP over >> Infiniband. >> >> I'm rather strongly opposed to IP over IB. We did run IPoIB years >> ago, but pulled it out of our environment as adding unneeded >> complexity. It requires provisioning IP addresses across the >> Infiniband infrastructure and possibly adding routers to other >> portions of the IP infrastructure. It was also confusing some users >> due to multiple IPs on the compute infrastructure. >> >> We have recently been in discussions with a vendor about their >> support for GPFS over IB and they kept directing us to using CM >> (which still didn't work). CM wasn't necessary once we found out >> about the actual problem (we needed the undocumented >> verbsRdmaUseGidIndexZero configuration option among other things due >> to their use of SR-IOV based virtual IB interfaces). >> >> We don't use routed Infiniband and it might be that CM and IPoIB is >> required for IB routing, but I doubt it. It sounds like the OP is >> keeping IB and IP infrastructure separate. >> >> Stuart Barkley >> >> On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: >> >> > Date: Mon, 26 Feb 2018 14:16:34 >> > From: Aaron Knister > > >> > Reply-To: gpfsug main discussion list >> > > > >> > To: gpfsug-discuss at spectrumscale.org >> >> > Subject: Re: [gpfsug-discuss] Problems with remote mount via >> routed IB >> > >> > Hi Jan Erik, >> > >> > It was my understanding that the IB hardware router required RDMA >> CM to work. >> > By default GPFS doesn't use the RDMA Connection Manager but it can >> be >> > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart >> on >> > clients/servers (in both clusters) to take effect. Maybe someone >> else >> > on the list can comment in more detail-- I've been told folks have >> > successfully deployed IB routers with GPFS. >> > >> > -Aaron >> > >> > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: >> > > >> > > Dear all >> > > >> > > we are currently trying to remote mount a file system in a routed >> > > Infiniband test setup and face problems with dropped RDMA >> > > connections. The setup is the >> > > following: >> > > >> > > - Spectrum Scale Cluster 1 is setup on four servers which are >> > > connected to the same infiniband network. Additionally they are >> > > connected to a fast ethernet providing ip communication in the >> network 192.168.11.0/24 . >> > > >> > > - Spectrum Scale Cluster 2 is setup on four additional servers >> which >> > > are connected to a second infiniband network. These servers >> have IPs >> > > on their IB interfaces in the network 192.168.12.0/24 >> . >> > > >> > > - IP is routed between 192.168.11.0/24 >> and 192.168.12.0/24 on a >> >> > > dedicated machine. >> > > >> > > - We have a dedicated IB hardware router connected to both IB >> subnets. >> > > >> > > >> > > We tested that the routing, both IP and IB, is working between >> the >> > > two clusters without problems and that RDMA is working fine >> both for >> > > internal communication inside cluster 1 and cluster 2 >> > > >> > > When trying to remote mount a file system from cluster 1 in >> cluster >> > > 2, RDMA communication is not working as expected. Instead we see >> > > error messages on the remote host (cluster 2) >> > > >> > > >> > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 1 >> > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 1 >> > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 1 >> > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 0 >> > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 0 >> > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 0 >> > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 2 >> > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > and in the cluster with the file system (cluster 1) >> > > >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > >> > > Any advice on how to configure the setup in a way that would >> allow >> > > the remote mount via routed IB would be very appreciated. >> > > >> > > >> > > Thank you and best regards >> > > Jan Erik >> > > >> > > >> > > >> > > >> > > _______________________________________________ >> > > gpfsug-discuss mailing list >> > > gpfsug-discuss at spectrumscale.org >> > > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp >> > > >> > > fsug.org >> %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data >> =01%7C01%7Cjohn.h >> > > earns%40asml.com >> %7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944e >> > > >> b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE >> > > YpqcNNP8%3D&reserved=0 >> > > >> > >> > -- >> > Aaron Knister >> > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight >> > Center >> > (301) 286-2776 >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org >> > >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfs >> > 3A%2F%2Fgpfs> >> > ug.org >> %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data= >> 01%7C01%7Cjohn.hearn >> > s%40asml.com >> %7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944eb2a39d >> > >> 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOS >> REYpqcNNP8 >> > %3D&reserved=0 >> > >> >> -- >> I've never been lost; I was once bewildered for three days, but >> never lost! >> -- Daniel Boone >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 >> > 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0> >> -- The information contained in this communication and any >> attachments is confidential and may be privileged, and is for the >> sole use of the intended recipient(s). Any unauthorized review, use, >> disclosure or distribution is prohibited. Unless explicitly stated >> otherwise in the body of this communication or the attachment >> thereto (if any), the information is provided on an AS-IS basis >> without any express or implied warranties or liabilities. To the >> extent you are relying on this information, you are doing so at your >> own risk. If you are not the intended recipient, please notify the >> sender immediately by replying to this message and destroy all >> copies of this message and any attachments. Neither the sender nor >> the company/group of companies he or she represents shall be liable >> for the proper and complete transmission of the information >> contained in this communication, or for any delay in its receipt. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> > -- > > Karlsruhe Institute of Technology (KIT) > Steinbuch Centre for Computing (SCC) > > Jan Erik Sundermann > > Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 > D-76344 Eggenstein-Leopoldshafen > > Tel: +49 721 608 26191 > Email: jan.sundermann at kit.edu > www.scc.kit.edu > > KIT ? The Research University in the Helmholtz Association > > Since 2010, KIT has been certified as a family-friendly university. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Mar 14 09:28:15 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 14 Mar 2018 10:28:15 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> Message-ID: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From luis.bolinches at fi.ibm.com Wed Mar 14 10:11:31 2018 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 14 Mar 2018 10:11:31 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: Hi For reads only have you look at possibility of using LROC? For writes in the setup you mention you are down to maximum of half your network speed (best case) assuming no restripes no reboots on going at any given time. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Consultant IT Specialist Mobile Phone: +358503112585 https://www.youracclaim.com/user/luis-bolinches "If you always give you will always have" -- Anonymous > On 14 Mar 2018, at 5.28, Lukas Hejtmanek wrote: > > Hello, > > thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe > disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD > that could build nice shared scratch. Moreover, I have no different HW or place > to put these SSDs into. They have to be in the compute nodes. > >> On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: >> I would like to discourage you from building a large distributed clustered >> filesystem made of many unreliable components. You will need to >> overprovision your interconnect and will also spend a lot of time in >> "healing" or "degraded" state. >> >> It is typically cheaper to centralize the storage into a subset of nodes >> and configure those to be more highly available. E.g. of your 60 nodes, >> take 8 and put all the storage into those and make that a dedicated GPFS >> cluster with no compute jobs on those nodes. Again, you'll still need >> really beefy and reliable interconnect to make this work. >> >> Stepping back; what is the actual problem you're trying to solve? I have >> certainly been in that situation before, where the problem is more like: "I >> have a fixed hardware configuration that I can't change, and I want to try >> to shoehorn a parallel filesystem onto that." >> >> I would recommend looking closer at your actual workloads. If this is a >> "scratch" filesystem and file access is mostly from one node at a time, >> it's not very useful to make two additional copies of that data on other >> nodes, and it will only slow you down. >> >> Regards, >> Alex >> >> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek >> wrote: >> >>>> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: >>>> Lukas, >>>> It looks like you are proposing a setup which uses your compute servers >>> as storage servers also? >>> >>> yes, exactly. I would like to utilise NVMe SSDs that are in every compute >>> servers.. Using them as a shared scratch area with GPFS is one of the >>> options. >>> >>>> >>>> * I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected >>>> >>>> There is nothing wrong with this concept, for instance see >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.beegfs.io_wiki_BeeOND&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZUDwVonh6dmGRFw0n9p9QPC2-DFuVyY75gOuD02c07I&e= >>>> >>>> I have an NVMe filesystem which uses 60 drives, but there are 10 servers. >>>> You should look at "failure zones" also. >>> >>> you still need the storage servers and local SSDs to use only for caching, >>> do >>> I understand correctly? >>> >>>> >>>> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- >>> bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. >>> (GSFC-606.2)[COMPUTER SCIENCE CORP] >>>> Sent: Monday, March 12, 2018 4:14 PM >>>> To: gpfsug main discussion list >>>> Subject: Re: [gpfsug-discuss] Preferred NSD >>>> >>>> Hi Lukas, >>>> >>>> Check out FPO mode. That mimics Hadoop's data placement features. You >>> can have up to 3 replicas both data and metadata but still the downside, >>> though, as you say is the wrong node failures will take your cluster down. >>>> >>>> You might want to check out something like Excelero's NVMesh (note: not >>> an endorsement since I can't give such things) which can create logical >>> volumes across all your NVMe drives. The product has erasure coding on >>> their roadmap. I'm not sure if they've released that feature yet but in >>> theory it will give better fault tolerance *and* you'll get more efficient >>> usage of your SSDs. >>>> >>>> I'm sure there are other ways to skin this cat too. >>>> >>>> -Aaron >>>> >>>> >>>> >>>> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek >> > wrote: >>>> Hello, >>>> >>>> I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected >>>> >>>> I would like to setup shared scratch area using GPFS and those NVMe >>> SSDs. Each >>>> SSDs as on NSD. >>>> >>>> I don't think like 5 or more data/metadata replicas are practical here. >>> On the >>>> other hand, multiple node failures is something really expected. >>>> >>>> Is there a way to instrument that local NSD is strongly preferred to >>> store >>>> data? I.e. node failure most probably does not result in unavailable >>> data for >>>> the other nodes? >>>> >>>> Or is there any other recommendation/solution to build shared scratch >>> with >>>> GPFS in such setup? (Do not do it including.) >>>> >>>> -- >>>> Luk?? Hejtm?nek >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>>> -- The information contained in this communication and any attachments >>> is confidential and may be privileged, and is for the sole use of the >>> intended recipient(s). Any unauthorized review, use, disclosure or >>> distribution is prohibited. Unless explicitly stated otherwise in the body >>> of this communication or the attachment thereto (if any), the information >>> is provided on an AS-IS basis without any express or implied warranties or >>> liabilities. To the extent you are relying on this information, you are >>> doing so at your own risk. If you are not the intended recipient, please >>> notify the sender immediately by replying to this message and destroy all >>> copies of this message and any attachments. Neither the sender nor the >>> company/group of companies he or she represents shall be liable for the >>> proper and complete transmission of the information contained in this >>> communication, or for any delay in its receipt. >>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>> >>> >>> -- >>> Luk?? Hejtm?nek >>> >>> Linux Administrator only because >>> Full Time Multitasking Ninja >>> is not an official job title >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint. com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 14 10:24:39 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 10:24:39 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: I would look at using LROC and possibly using HAWC ... Note you need to be a bit careful with HAWC client side and failure group placement. Simon ?On 14/03/2018, 09:28, "gpfsug-discuss-bounces at spectrumscale.org on behalf of xhejtman at ics.muni.cz" wrote: Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From zacekm at img.cas.cz Wed Mar 14 10:57:36 2018 From: zacekm at img.cas.cz (Michal Zacek) Date: Wed, 14 Mar 2018 11:57:36 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> Message-ID: <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> Hi, I don't think the GPFS is good choice for your setup. Did you consider GlusterFS? It's used at Max Planck Institute at Dresden for HPC computing of? Molecular Biology data. They have similar setup,? tens (hundreds) of computers with shared local storage in glusterfs. But you will need 10Gb network. Michal Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): > On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: >> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: >>> I don't think like 5 or more data/metadata replicas are practical here. On the >>> other hand, multiple node failures is something really expected. >> Umm.. do I want to ask *why*, out of only 60 nodes, multiple node >> failures are an expected event - to the point that you're thinking >> about needing 5 replicas to keep things running? > as of my experience with cluster management, we have multiple nodes down on > regular basis. (HW failure, SW maintenance and so on.) > > I'm basically thinking that 2-3 replicas might not be enough while 5 or more > are becoming too expensive (both disk space and required bandwidth being > scratch space - high i/o load expected). > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3776 bytes Desc: Elektronicky podpis S/MIME URL: From aaron.s.knister at nasa.gov Wed Mar 14 15:28:53 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 14 Mar 2018 11:28:53 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> Message-ID: <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> I don't want to start a religious filesystem war, but I'd give pause to GlusterFS based on a number of operational issues I've personally experienced and seen others experience with it. I'm curious how glusterfs would resolve the issue here of multiple clients failing simultaneously (unless you're talking about using disperse volumes)? That does, actually, bring up an interesting question to IBM which is -- when will mestor see the light of day? This is admittedly something other filesystems can do that GPFS cannot. -Aaron On 3/14/18 6:57 AM, Michal Zacek wrote: > Hi, > > I don't think the GPFS is good choice for your setup. Did you consider > GlusterFS? It's used at Max Planck Institute at Dresden for HPC > computing of? Molecular Biology data. They have similar setup,? tens > (hundreds) of computers with shared local storage in glusterfs. But you > will need 10Gb network. > > Michal > > > Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): >> On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: >>> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: >>>> I don't think like 5 or more data/metadata replicas are practical here. On the >>>> other hand, multiple node failures is something really expected. >>> Umm.. do I want to ask *why*, out of only 60 nodes, multiple node >>> failures are an expected event - to the point that you're thinking >>> about needing 5 replicas to keep things running? >> as of my experience with cluster management, we have multiple nodes down on >> regular basis. (HW failure, SW maintenance and so on.) >> >> I'm basically thinking that 2-3 replicas might not be enough while 5 or more >> are becoming too expensive (both disk space and required bandwidth being >> scratch space - high i/o load expected). >> > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From skylar2 at u.washington.edu Wed Mar 14 15:42:37 2018 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Wed, 14 Mar 2018 15:42:37 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> Message-ID: <20180314154237.u4d3hqraqcn6a4xl@utumno.gs.washington.edu> I agree. We have a small Gluster filesystem we use to perform failover of our job scheduler, but it predates our use of GPFS. We've run into a number of strange failures and "soft failures" (i.e. filesystem admin tools don't work but the filesystem is available), and the logging is much more cryptic and jumbled than mmfs.log. We'll soon be retiring it in favor of GPFS. On Wed, Mar 14, 2018 at 11:28:53AM -0400, Aaron Knister wrote: > I don't want to start a religious filesystem war, but I'd give pause to > GlusterFS based on a number of operational issues I've personally > experienced and seen others experience with it. > > I'm curious how glusterfs would resolve the issue here of multiple clients > failing simultaneously (unless you're talking about using disperse volumes)? > That does, actually, bring up an interesting question to IBM which is -- > when will mestor see the light of day? This is admittedly something other > filesystems can do that GPFS cannot. > > -Aaron > > On 3/14/18 6:57 AM, Michal Zacek wrote: > > Hi, > > > > I don't think the GPFS is good choice for your setup. Did you consider > > GlusterFS? It's used at Max Planck Institute at Dresden for HPC > > computing of? Molecular Biology data. They have similar setup,? tens > > (hundreds) of computers with shared local storage in glusterfs. But you > > will need 10Gb network. > > > > Michal > > > > > > Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): > > > On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: > > > > On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > > > > > I don't think like 5 or more data/metadata replicas are practical here. On the > > > > > other hand, multiple node failures is something really expected. > > > > Umm.. do I want to ask *why*, out of only 60 nodes, multiple node > > > > failures are an expected event - to the point that you're thinking > > > > about needing 5 replicas to keep things running? > > > as of my experience with cluster management, we have multiple nodes down on > > > regular basis. (HW failure, SW maintenance and so on.) > > > > > > I'm basically thinking that 2-3 replicas might not be enough while 5 or more > > > are becoming too expensive (both disk space and required bandwidth being > > > scratch space - high i/o load expected). > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From JRLang at uwyo.edu Wed Mar 14 14:11:35 2018 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Wed, 14 Mar 2018 14:11:35 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Mar 14 16:54:16 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 16:54:16 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz>, Message-ID: Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From r.sobey at imperial.ac.uk Wed Mar 14 17:33:02 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 14 Mar 2018 17:33:02 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz>, Message-ID: >> 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ulmer at ulmer.org Wed Mar 14 18:59:29 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 14 Mar 2018 14:59:29 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen > On Mar 14, 2018, at 1:33 PM, Sobey, Richard A > wrote: > >>> 2. Have data management edition and capacity license the amount of storage. > There goes the budget ? > > Richard > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Simon Thompson (IT Research Support) > Sent: 14 March 2018 16:54 > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > Not always true. > > 1. Use them with socket licenses as HAWC or LROC is OK on a client. > 2. Have data management edition and capacity license the amount of storage. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org ] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu ] > Sent: 14 March 2018 14:11 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD > > Something I haven't heard in this discussion, it that of licensing of GPFS. > > I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Lukas Hejtmanek > Sent: Wednesday, March 14, 2018 4:28 AM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > Hello, > > thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. > > On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: >> I would like to discourage you from building a large distributed >> clustered filesystem made of many unreliable components. You will >> need to overprovision your interconnect and will also spend a lot of >> time in "healing" or "degraded" state. >> >> It is typically cheaper to centralize the storage into a subset of >> nodes and configure those to be more highly available. E.g. of your >> 60 nodes, take 8 and put all the storage into those and make that a >> dedicated GPFS cluster with no compute jobs on those nodes. Again, >> you'll still need really beefy and reliable interconnect to make this work. >> >> Stepping back; what is the actual problem you're trying to solve? I >> have certainly been in that situation before, where the problem is >> more like: "I have a fixed hardware configuration that I can't change, >> and I want to try to shoehorn a parallel filesystem onto that." >> >> I would recommend looking closer at your actual workloads. If this is >> a "scratch" filesystem and file access is mostly from one node at a >> time, it's not very useful to make two additional copies of that data >> on other nodes, and it will only slow you down. >> >> Regards, >> Alex >> >> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek >> > >> wrote: >> >>> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: >>>> Lukas, >>>> It looks like you are proposing a setup which uses your compute >>>> servers >>> as storage servers also? >>> >>> yes, exactly. I would like to utilise NVMe SSDs that are in every >>> compute servers.. Using them as a shared scratch area with GPFS is >>> one of the options. >>> >>>> >>>> * I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB >>>> interconnected >>>> >>>> There is nothing wrong with this concept, for instance see >>>> https://www.beegfs.io/wiki/BeeOND >>>> >>>> I have an NVMe filesystem which uses 60 drives, but there are 10 servers. >>>> You should look at "failure zones" also. >>> >>> you still need the storage servers and local SSDs to use only for >>> caching, do I understand correctly? >>> >>>> >>>> From: gpfsug-discuss-bounces at spectrumscale.org >>>> [mailto:gpfsug-discuss- >>> bounces at spectrumscale.org ] On Behalf Of Knister, Aaron S. >>> (GSFC-606.2)[COMPUTER SCIENCE CORP] >>>> Sent: Monday, March 12, 2018 4:14 PM >>>> To: gpfsug main discussion list > >>>> Subject: Re: [gpfsug-discuss] Preferred NSD >>>> >>>> Hi Lukas, >>>> >>>> Check out FPO mode. That mimics Hadoop's data placement features. >>>> You >>> can have up to 3 replicas both data and metadata but still the >>> downside, though, as you say is the wrong node failures will take your cluster down. >>>> >>>> You might want to check out something like Excelero's NVMesh >>>> (note: not >>> an endorsement since I can't give such things) which can create >>> logical volumes across all your NVMe drives. The product has erasure >>> coding on their roadmap. I'm not sure if they've released that >>> feature yet but in theory it will give better fault tolerance *and* >>> you'll get more efficient usage of your SSDs. >>>> >>>> I'm sure there are other ways to skin this cat too. >>>> >>>> -Aaron >>>> >>>> >>>> >>>> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek >>>> >>> >> wrote: >>>> Hello, >>>> >>>> I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB >>>> interconnected >>>> >>>> I would like to setup shared scratch area using GPFS and those >>>> NVMe >>> SSDs. Each >>>> SSDs as on NSD. >>>> >>>> I don't think like 5 or more data/metadata replicas are practical here. >>> On the >>>> other hand, multiple node failures is something really expected. >>>> >>>> Is there a way to instrument that local NSD is strongly preferred >>>> to >>> store >>>> data? I.e. node failure most probably does not result in >>>> unavailable >>> data for >>>> the other nodes? >>>> >>>> Or is there any other recommendation/solution to build shared >>>> scratch >>> with >>>> GPFS in such setup? (Do not do it including.) >>>> >>>> -- >>>> Luk?? Hejtm?nek >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> -- The information contained in this communication and any >>>> attachments >>> is confidential and may be privileged, and is for the sole use of >>> the intended recipient(s). Any unauthorized review, use, disclosure >>> or distribution is prohibited. Unless explicitly stated otherwise in >>> the body of this communication or the attachment thereto (if any), >>> the information is provided on an AS-IS basis without any express or >>> implied warranties or liabilities. To the extent you are relying on >>> this information, you are doing so at your own risk. If you are not >>> the intended recipient, please notify the sender immediately by >>> replying to this message and destroy all copies of this message and >>> any attachments. Neither the sender nor the company/group of >>> companies he or she represents shall be liable for the proper and >>> complete transmission of the information contained in this communication, or for any delay in its receipt. >>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> -- >>> Luk?? Hejtm?nek >>> >>> Linux Administrator only because >>> Full Time Multitasking Ninja >>> is not an official job title >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Mar 14 19:23:18 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 14 Mar 2018 14:23:18 -0500 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Message-ID: My understanding is that with Spectrum Scale 5.0 there is no longer a standard edition, only data management and advanced, and the pricing is all done via storage not sockets. Now there may be some grandfathering for those with existing socket licenses but I really do not know. My point is that data management is not the same as advanced edition. Again I could be wrong because I tend not to concern myself with how the product is licensed. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Stephen Ulmer To: gpfsug main discussion list Date: 03/14/2018 03:06 PM Subject: Re: [gpfsug-discuss] Preferred NSD Sent by: gpfsug-discuss-bounces at spectrumscale.org Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen On Mar 14, 2018, at 1:33 PM, Sobey, Richard A wrote: 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org < gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [ JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org < gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek wrote: On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=kB88vNQV9x5UFOu3tBxpRKmS3rSCi68KIBxOa_D5ji8&s=R9wxUL1IMkjtWZsFkSAXRUmuKi8uS1jpQRYVTvOYq3g&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 14 19:27:57 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 19:27:57 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org>, Message-ID: I don't think this is correct. My understanding is: There is no longer express edition. Grand fathered to standard. Standard edition (sockets) remains. Advanced edition (sockets) is available for existing advanced customers only. Grand fathering to DME available. Data management (mostly capacity but per disk in ESS and DSS-G configs, different cost for flash or spinning drives). I'm sure Carl can correct me if I'm wrong here. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of stockf at us.ibm.com [stockf at us.ibm.com] Sent: 14 March 2018 19:23 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD My understanding is that with Spectrum Scale 5.0 there is no longer a standard edition, only data management and advanced, and the pricing is all done via storage not sockets. Now there may be some grandfathering for those with existing socket licenses but I really do not know. My point is that data management is not the same as advanced edition. Again I could be wrong because I tend not to concern myself with how the product is licensed. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Stephen Ulmer To: gpfsug main discussion list Date: 03/14/2018 03:06 PM Subject: Re: [gpfsug-discuss] Preferred NSD Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen On Mar 14, 2018, at 1:33 PM, Sobey, Richard A > wrote: 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=kB88vNQV9x5UFOu3tBxpRKmS3rSCi68KIBxOa_D5ji8&s=R9wxUL1IMkjtWZsFkSAXRUmuKi8uS1jpQRYVTvOYq3g&e= From makaplan at us.ibm.com Wed Mar 14 20:02:15 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 14 Mar 2018 15:02:15 -0500 Subject: [gpfsug-discuss] Editions and Licensing / Preferred NSD In-Reply-To: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Message-ID: Thread seems to have gone off on a product editions and Licensing tangents -- refer to IBM website for official statements: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1in_IntroducingIBMSpectrumScale.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Wed Mar 14 15:36:32 2018 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Wed, 14 Mar 2018 15:36:32 +0000 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Message-ID: Is it possible (albeit not advisable) to mirror LUNs that are NSD's to another storage array in another site basically for DR purposes? Once it's mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Wed Mar 14 20:31:01 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Wed, 14 Mar 2018 20:31:01 +0000 Subject: [gpfsug-discuss] Editions and Licensing / Preferred NSD In-Reply-To: References: Message-ID: Simon's description is correct. For those who don't have it readily to hand I'll reiterate it here (in my own words): We discontinued Express a while back; everybody on that edition got a free upgrade to Standard. Standard continues to be licensed on sockets. This has certain advantages (clients and FPOs nodes are cheap, but as noted in the thread if you need to change them to servers, they get more expensive) Advanced was retired; those already on it were "grandfathered in" can continue to buy it, so no forced conversion. But no new customers. In place of Advanced, Data Management Edition is licensed by the TiB. This has the advantage of simplicity -- it is completely flat regardless of topology. It also allows you to add and subtract nodes, including clients, or change a client node to a server node, at will without having to go through a licensing transaction or keep count of clients or pay a penalty for putting clients in a separate compute cluster or ... BTW, I'll be at the UG in London and (probably) in Boston, if anybody wants to talk licensing... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com ********************************************** From olaf.weiser at de.ibm.com Wed Mar 14 23:19:03 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Mar 2018 00:19:03 +0100 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From secretary at gpfsug.org Thu Mar 15 10:00:08 2018 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 15 Mar 2018 10:00:08 +0000 Subject: [gpfsug-discuss] Meetup at the IBM System Z Technical University Message-ID: <738c1046e602fb96e1dc6e5772c0a65a@webmail.gpfsug.org> Dear members, We have another meet up opportunity for you! There's a Spectrum Scale Meet Up taking place at the System Z Technical University on 14th May in London. It's free to attend and is an ideal opportunity to learn about Spectrum Scale on IBM Z in particular and hear from the UK Met Office. Please email your registration to Par Hettinga par at nl.ibm.com and if you have any questions, please contact Par. Date: Monday 14th May 2018 Time: 4.15pm - 6:15 PM Agenda: 16.15 - Welcome & Introductions 16.25 - IBM Spectrum Scale and Industry Use Cases for IBM System Z 17.10 - UK Met Office - Why IBM Spectrum Scale with System Z 17.40 - Spectrum Scale on IBM Z 18.10 - Questions & Close 18.15 - Drinks & Networking Location: Room B4 Beaujolais Novotel London West 1 Shortlands London W6 8DR United Kingdom 020 7660 0680 Thanks, -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Mar 15 14:57:41 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 15 Mar 2018 09:57:41 -0500 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: Does the mirrored-storage vendor guarantee the sequence of all writes to all the LUNs at the remote-site exactly matches the sequence of writes to the local site....? If not.. the file system on the remote-site could be left in an inconsistent state when the communications line is cut... Guaranteeing sequencing to each LUN is not sufficient, because a typical GPFS file system has its data and metadata spread over several LUNs. From: "Olaf Weiser" To: gpfsug main discussion list Date: 03/14/2018 07:19 PM Subject: Re: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org HI Mark.. yes.. that's possible... at least , I'm sure.. there was a chapter in the former advanced admin guide of older releases with PPRC .. how to do that.. similar to PPRC , you might use other methods , but from gpfs perspective this should'nt make a difference.. and I had have a german customer, who was doing this for years... (but it is some years back meanwhile ... hihi time flies...) From: Mark Bush To: "gpfsug-discuss at spectrumscale.org" Date: 03/14/2018 09:11 PM Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org Is it possible (albeit not advisable) to mirror LUNs that are NSD?s to another storage array in another site basically for DR purposes? Once it?s mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=vq-nGaYTObfhVeW9E8fpLCJ9MIi9SNCiO5yYfXwJWhY&s=9o--h1_iFfwOmI2jRmxRjZSJX7IfQSFwUi6AfFhEas0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Thu Mar 15 15:07:30 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Thu, 15 Mar 2018 11:07:30 -0400 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: <26547.1521126450@turing-police.cc.vt.edu> On Wed, 14 Mar 2018 15:36:32 -0000, Mark Bush said: > Is it possible (albeit not advisable) to mirror LUNs that are NSD's to > another storage array in another site basically for DR purposes? Once it's > mirrored to a new cluster elsewhere what would be the step to get the > filesystem back up and running. I know that AFM-DR is meant for this but in > this case my client only has Standard edition and has mirroring software > purchased with the underlying disk array. > Is this even doable? We had a discussion on the list about this recently. The upshot is that it's sort of doable, but depends on what failure modes you're trying to protect against. The basic problem is that if you're doing mirroring at the array level, there's a certain amount of skew delay where GPFS has written stuff on the local disk and it hasn't been copied to the remote disk (basically the same reason why running fsck on a mounted disk partition can be problematic). There's also issues if things are scribbling on the local file system and generating enough traffic to saturate the network link you're doing the mirroring over, for a long enough time to overwhelm the mirroring mechanism (both sync and async mirroring have their good and bad sides in that scenario) We're using a stretch cluster with GPFS replication to storage about 95 cable miles away - that has the advantage that then GPFS knows there's a remote replica and can take more steps to make sure the remote copy is consistent. In particular, if it knows there's replication that needs to be done and it's getting backlogged, it can present a slow-down to the local writers and ensure that the remote set of disks don't fall too far behind.... (There's some funkyness having to do with quorum - it's *really* hard to set up so you have both protection against split-brain and the ability to start up the remote site stand-alone - mostly because from the remote point of view, starting up stand-alone after the main site fails looks identical to split-brain) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From janfrode at tanso.net Thu Mar 15 17:12:23 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 15 Mar 2018 18:12:23 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: > You could also use the GPFS prestartup callback (mmaddcallback) to execute > a script synchronously that waits for the IB ports to become available > before returning and allowing GPFS to continue. Not systemd integrated but > it should work. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 <(720)%20430-8821> > stockf at us.ibm.com > > > > From: david_johnson at brown.edu > To: gpfsug main discussion list > Date: 03/08/2018 07:34 AM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > become active > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Until IBM provides a solution, here is my workaround. Add it so it runs > before the gpfs script, I call it from our custom xcat diskless boot > scripts. Based on rhel7, not fully systemd integrated. YMMV! > > Regards, > ? ddj > ??- > [ddj at storage041 ~]$ cat /etc/init.d/ibready > #! /bin/bash > # > # chkconfig: 2345 06 94 > # /etc/rc.d/init.d/ibready > # written in 2016 David D Johnson (ddj *brown.edu* > > ) > # > ### BEGIN INIT INFO > # Provides: ibready > # Required-Start: > # Required-Stop: > # Default-Stop: > # Description: Block until infiniband is ready > # Short-Description: Block until infiniband is ready > ### END INIT INFO > > RETVAL=0 > if [[ -d /sys/class/infiniband ]] > then > IBDEVICE=$(dirname $(grep -il infiniband > /sys/class/infiniband/*/ports/1/link* | head -n 1)) > fi > # See how we were called. > case "$1" in > start) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo -n "Polling for InfiniBand link up: " > for (( count = 60; count > 0; count-- )) > do > if grep -q ACTIVE $IBDEVICE/state > then > echo ACTIVE > break > fi > echo -n "." > sleep 5 > done > if (( count <= 0 )) > then > echo DOWN - $0 timed out > fi > fi > ;; > stop|restart|reload|force-reload|condrestart|try-restart) > ;; > status) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo "$IBDEVICE is $(< $IBDEVICE/state) $(< > $IBDEVICE/rate)" > else > echo "No IBDEVICE found" > fi > ;; > *) > echo "Usage: ibready {start|stop|status|restart| > reload|force-reload|condrestart|try-restart}" > exit 2 > esac > exit ${RETVAL} > ???? > > -- ddj > Dave Johnson > > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) < > *marc.caubet at psi.ch* > wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the > IB link becomes up. Is there a way to force GPFS waiting to start until IB > ports are up? This can be probably done by adding something like > After=network-online.target and Wants=network-online.target in the systemd > file but I would like to know if this is natively possible from the GPFS > configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 <+41%2056%20310%2046%2067> > E-Mail: *marc.caubet at psi.ch* > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_ > iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_ > Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqF > yIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Mar 15 17:23:38 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 15 Mar 2018 12:23:38 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: The callback is the only way I know to use the "--onerror shutdown" option. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/15/2018 01:14 PM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=79jdzLLNtYEi36P6EifUd1cEI2GcLu2QWCwYwln12xg&s=AgoxRgQ2Ht0ZWCfogYsyg72RZn33CfTEyW7h1JQWRrM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Mar 15 17:30:49 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Mar 2018 18:30:49 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: An HTML attachment was scrubbed... URL: From chris.schlipalius at pawsey.org.au Fri Mar 16 06:11:39 2018 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Fri, 16 Mar 2018 14:11:39 +0800 Subject: [gpfsug-discuss] Reminder, 2018 March 26th Singapore Spectrum Scale User Group event is on soon. In-Reply-To: <54D1282A-175F-42AD-8BCA-DDA48326B80C@pawsey.org.au> References: <54D1282A-175F-42AD-8BCA-DDA48326B80C@pawsey.org.au> Message-ID: <988B0149-D942-41AD-93B9-E9A0ACAF7D9F@pawsey.org.au> Hello, This is a reminder for the the inaugural Spectrum Scale Usergroup Singapore on Monday 26th March 2018, Sentosa, Singapore. This event occurs just before SCA18 starts and is being held in conjunction with SCA18 https://sc-asia.org/ All current Singapore Spectrum Scale User Group event details can be found here: http://goo.gl/dXtqvS Feel free to circulate this event link to all that may need it. Please reserve your tickets now as tickets for places will close soon. There are some great speakers and topics, for details please see the agenda on Eventbrite. We are looking forwards to a great new Usergroup in a fabulous venue. Thanks again to NSCC and IBM for helping to arrange the venue and event booking. Regards, Chris Schlipalius IBM Champion 2018 Team Lead, Storage Infrastructure, Data & Visualisation, The Pawsey Supercomputing Centre (CSIRO) 13 Burvill Court Kensington WA 6151 Australia Tel +61 8 6436 8815 Email chris.schlipalius at pawsey.org.au Web www.pawsey.org.au From janfrode at tanso.net Fri Mar 16 08:29:59 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 16 Mar 2018 09:29:59 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: Thanks Olaf, but we don't use NetworkManager on this cluster.. I now created this simple script: ------------------------------------------------------------------------------------------------------------------------------------------------------------- #! /bin/bash - # # Fail mmstartup if not all configured IB ports are active. # # Install with: # # mmaddcallback fail-if-ibfail --command /var/mmfs/etc/fail-if-ibfail --event preStartup --sync --onerror shutdown # for port in $(/usr/lpp/mmfs/bin/mmdiag --config|grep verbsPorts | cut -f 4- -d " ") do grep -q ACTIVE /sys/class/infiniband/${port%/*}/ports/${port##*/}/state || exit 1 done ------------------------------------------------------------------------------------------------------------------------------------------------------------- which I haven't tested, but assume should work. Suggestions for improvements would be much appreciated! -jf On Thu, Mar 15, 2018 at 6:30 PM, Olaf Weiser wrote: > > you can try : > systemctl enable NetworkManager-wait-online > ln -s '/usr/lib/systemd/system/NetworkManager-wait-online.service' > '/etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online. > service' > > in many cases .. it helps .. > > > > > > From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 03/15/2018 06:18 PM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > becomeactive > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > I found some discussion on this at > *https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25* > and > there it's claimed that none of the callback events are early enough to > resolve this. That we need a pre-preStartup trigger. Any idea if this has > changed -- or is the callback option then only to do a "--onerror > shutdown" if it has failed to connect IB ? > > > On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock <*stockf at us.ibm.com* > > wrote: > You could also use the GPFS prestartup callback (mmaddcallback) to execute > a script synchronously that waits for the IB ports to become available > before returning and allowing GPFS to continue. Not systemd integrated but > it should work. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | *720-430-8821* <(720)%20430-8821> > *stockf at us.ibm.com* > > > > From: *david_johnson at brown.edu* > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > > > Date: 03/08/2018 07:34 AM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > become active > Sent by: *gpfsug-discuss-bounces at spectrumscale.org* > > ------------------------------ > > > > > Until IBM provides a solution, here is my workaround. Add it so it runs > before the gpfs script, I call it from our custom xcat diskless boot > scripts. Based on rhel7, not fully systemd integrated. YMMV! > > Regards, > ? ddj > ??- > [ddj at storage041 ~]$ cat /etc/init.d/ibready > #! /bin/bash > # > # chkconfig: 2345 06 94 > # /etc/rc.d/init.d/ibready > # written in 2016 David D Johnson (ddj *brown.edu* > > ) > # > ### BEGIN INIT INFO > # Provides: ibready > # Required-Start: > # Required-Stop: > # Default-Stop: > # Description: Block until infiniband is ready > # Short-Description: Block until infiniband is ready > ### END INIT INFO > > RETVAL=0 > if [[ -d /sys/class/infiniband ]] > then > IBDEVICE=$(dirname $(grep -il infiniband > /sys/class/infiniband/*/ports/1/link* | head -n 1)) > fi > # See how we were called. > case "$1" in > start) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo -n "Polling for InfiniBand link up: " > for (( count = 60; count > 0; count-- )) > do > if grep -q ACTIVE $IBDEVICE/state > then > echo ACTIVE > break > fi > echo -n "." > sleep 5 > done > if (( count <= 0 )) > then > echo DOWN - $0 timed out > fi > fi > ;; > stop|restart|reload|force-reload|condrestart|try-restart) > ;; > status) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo "$IBDEVICE is $(< $IBDEVICE/state) $(< > $IBDEVICE/rate)" > else > echo "No IBDEVICE found" > fi > ;; > *) > echo "Usage: ibready {start|stop|status|restart| > reload|force-reload|condrestart|try-restart}" > exit 2 > esac > exit ${RETVAL} > ???? > > -- ddj > Dave Johnson > > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) < > *marc.caubet at psi.ch* > wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the > IB link becomes up. Is there a way to force GPFS waiting to start until IB > ports are up? This can be probably done by adding something like > After=network-online.target and Wants=network-online.target in the systemd > file but I would like to know if this is natively possible from the GPFS > configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: *+41 56 310 46 67* <+41%2056%20310%2046%2067> > E-Mail: *marc.caubet at psi.ch* > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e=* > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Fri Mar 16 08:46:37 2018 From: YARD at il.ibm.com (Yaron Daniel) Date: Fri, 16 Mar 2018 10:46:37 +0200 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: Hi You can have few options: 1) Active/Active GPFS sites - with sync replication of the storage - take into account the latency you have. 2) Active/StandBy Gpfs sites- with a-sync replication of the storage. All info can be found at : https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adv_continous_replication_SSdata.htm Synchronous mirroring with GPFS replication In a configuration utilizing GPFS? replication, a single GPFS cluster is defined over three geographically-separate sites consisting of two production sites and a third tiebreaker site. One or more file systems are created, mounted, and accessed concurrently from the two active production sites. Synchronous mirroring utilizing storage based replication This topic describes synchronous mirroring utilizing storage-based replication. Point In Time Copy of IBM Spectrum Scale data Most storage systems provides functionality to make a point-in-time copy of data as an online backup mechanism. This function provides an instantaneous copy of the original data on the target disk, while the actual copy of data takes place asynchronously and is fully transparent to the user. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Storage Architect Petach Tiqva, 49527 IBM Global Markets, Systems HW Sales Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Mark Bush To: "gpfsug-discuss at spectrumscale.org" Date: 03/14/2018 10:10 PM Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org Is it possible (albeit not advisable) to mirror LUNs that are NSD?s to another storage array in another site basically for DR purposes? Once it?s mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=c9HNr6pLit8n4hQKpcYyyRg9ZnITpo_2OiEx6hbukYA&s=qFgC1ebi1SJvnCRlc92cI4hZqZYpK7EneZ0Sati5s5E&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4376 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 5093 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4746 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4557 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 11294 bytes Desc: not available URL: From stockf at us.ibm.com Fri Mar 16 12:05:29 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 16 Mar 2018 07:05:29 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: I have my doubts that mmdiag can be used in this script. In general the guidance is to avoid or be very careful with mm* commands in a callback due to the potential for deadlock. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/16/2018 04:30 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Olaf, but we don't use NetworkManager on this cluster.. I now created this simple script: ------------------------------------------------------------------------------------------------------------------------------------------------------------- #! /bin/bash - # # Fail mmstartup if not all configured IB ports are active. # # Install with: # # mmaddcallback fail-if-ibfail --command /var/mmfs/etc/fail-if-ibfail --event preStartup --sync --onerror shutdown # for port in $(/usr/lpp/mmfs/bin/mmdiag --config|grep verbsPorts | cut -f 4- -d " ") do grep -q ACTIVE /sys/class/infiniband/${port%/*}/ports/${port##*/}/state || exit 1 done ------------------------------------------------------------------------------------------------------------------------------------------------------------- which I haven't tested, but assume should work. Suggestions for improvements would be much appreciated! -jf On Thu, Mar 15, 2018 at 6:30 PM, Olaf Weiser wrote: you can try : systemctl enable NetworkManager-wait-online ln -s '/usr/lib/systemd/system/NetworkManager-wait-online.service' '/etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online.service' in many cases .. it helps .. From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/15/2018 06:18 PM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=xImYTxt4pm1o5znVn5Vdoka2uxgsTRpmlCGdEWhB9vw&s=veOZZz80aBzoCTKusx6WOpVlYs64eNkp5pM9kbHgvic&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Fri Mar 16 14:25:52 2018 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Fri, 16 Mar 2018 14:25:52 +0000 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads Message-ID: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> Hello GPFS Team, We are observing strange behavior of GPFS during startup on SLES12 node. In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base and when GPFS starts for the first time on this node, it complains about too little NSD threads: .. 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.2.3.7 Built: Feb 15 2018 11:38:38} ... 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... .. 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... 2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 more threads, exceeds max thread count 1024 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down. 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not initialize network shared disks 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 2018-03-16_13:11:30.701+0100: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds before restarting mmfsd 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup GPFS starts loop and tries to respawn mmfsd periodically: 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds before restarting mmfsd It seems that this issue can be resolved by doing mmshutdown. Later, when we manually perform mmstartup the problem is gone. We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11 SP4. Only on VLP1 we installed SLES12 SP3. The test cluster looks as below: Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------- 1 VLP0.cs-intern 192.168.101.210 VLP0.cs-intern quorum-manager-snmp_collector 2 VLP1.cs-intern 192.168.101.211 VLP1.cs-intern quorum-manager 3 TBP0.cs-intern 192.168.101.215 TBP0.cs-intern quorum 4 IDP0.cs-intern 192.168.101.110 IDP0.cs-intern 5 IDP1.cs-intern 192.168.101.111 IDP1.cs-intern 6 IDP2.cs-intern 192.168.101.112 IDP2.cs-intern 7 IDP3.cs-intern 192.168.101.113 IDP3.cs-intern 8 ICP0.cs-intern 192.168.101.10 ICP0.cs-intern 9 ICP1.cs-intern 192.168.101.11 ICP1.cs-intern 10 ICP2.cs-intern 192.168.101.12 ICP2.cs-intern 11 ICP3.cs-intern 192.168.101.13 ICP3.cs-intern 12 ICP4.cs-intern 192.168.101.14 ICP4.cs-intern 13 ICP5.cs-intern 192.168.101.15 ICP5.cs-intern We have enabled traces and reproduced the issue as follows: 1. When GPFS daemon was in a respawn loop, we have started traces, all files from this period you can find in uploaded archive under 1_nsd_threads_problem directory 2. We have manually stopped the "respawn" loop on VLP1 by executing mmshutdown and start GPFS manually by mmstartup. All traces from this execution can be found in archive file under 2_mmshutdown_mmstartup directory All data related to this problem is uploaded to our ftp to file: ftp.ts.fujitsu.com/CS-Diagnose/IBM, (fe_cs_oem, 12Monkeys) item435_nsd_threads.tar.gz Could you please have a look at this problem? Best regards, Tomasz Wolski -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Mar 16 14:52:11 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 16 Mar 2018 10:52:11 -0400 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads In-Reply-To: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> References: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> Message-ID: <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> Ah. You, my friend, have been struck by a smooth criminal. And by smooth criminal I mean systemd. I ran into this last week and spent many hours banging my head against the wall trying to figure it out. systemd by default limits cgroups to I think 512 tasks and since a thread counts as a task that's likely what you're running into. Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then reboot (and yes, I mean reboot. changing it live doesn't seem possible because of the infinite wisdom of the systemd developers). The pid limit of a given slice/unit cgroup may already be overriden to something more reasonable than the 512 default so if, for example, you were logging in and startng it via ssh the limit may be different than if its started from the gpfs.service unit because mmfsd effectively is running in different cgroups in each case. Hope that helps! -Aaron On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote: > Hello GPFS Team, > > We are observing strange behavior of GPFS during startup on SLES12 node. > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base > and when GPFS starts for the first time on this node, it complains about > > too little NSD threads: > > .. > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > {Version: 4.2.3.7?? Built: Feb 15 2018 11:38:38} ... > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > .. > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > more threads, exceeds max thread count 1024_* > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down. > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > initialize network shared disks > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > Removing old /var/adm/ras/mmfs.log.* files: > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > before restarting mmfsd > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > GPFS starts loop and tries to respawn mmfsd periodically: > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > before restarting mmfsd_* > > It seems that this issue can be resolved by doing mmshutdown. Later, > when we manually perform mmstartup the problem is gone. > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11 > SP4. Only on VLP1 we installed SLES12 SP3. > > The test cluster looks as below: > > Node? Daemon node name? IP address?????? Admin node name? Designation > > ----------------------------------------------------------------------- > > ?? 1?? VLP0.cs-intern??? 192.168.101.210? VLP0.cs-intern > quorum-manager-snmp_collector > > ?? 2?? VLP1.cs-intern??? 192.168.101.211? VLP1.cs-intern?? quorum-manager > > ?? 3?? TBP0.cs-intern??? 192.168.101.215? TBP0.cs-intern?? quorum > > ?? 4?? IDP0.cs-intern??? 192.168.101.110? IDP0.cs-intern > > ?? 5?? IDP1.cs-intern??? 192.168.101.111? IDP1.cs-intern > > ?? 6?? IDP2.cs-intern??? 192.168.101.112? IDP2.cs-intern > > ?? 7?? IDP3.cs-intern??? 192.168.101.113? IDP3.cs-intern > > ?? 8?? ICP0.cs-intern??? 192.168.101.10?? ICP0.cs-intern > > ?? 9?? ICP1.cs-intern??? 192.168.101.11?? ICP1.cs-intern > > ? 10?? ICP2.cs-intern??? 192.168.101.12?? ICP2.cs-intern > > ? 11?? ICP3.cs-intern??? 192.168.101.13?? ICP3.cs-intern > > ? 12?? ICP4.cs-intern??? 192.168.101.14?? ICP4.cs-intern > > ? 13?? ICP5.cs-intern??? 192.168.101.15?? ICP5.cs-intern > > We have enabled traces and reproduced the issue as follows: > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > files from this period you can find in uploaded archive under > *_1_nsd_threads_problem_* directory > > 2.We have manually stopped the ?respawn? loop on VLP1 by executing > mmshutdown and start GPFS manually by mmstartup. All traces from this > execution can be found in archive file under *_2_mmshutdown_mmstartup > _*directory > > All data related to this problem is uploaded to our ftp to file: > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > , (fe_cs_oem, 12Monkeys) > item435_nsd_threads.tar.gz > > Could you please have a look at this problem? > > Best regards, > > Tomasz Wolski > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Tomasz.Wolski at ts.fujitsu.com Fri Mar 16 15:01:08 2018 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Fri, 16 Mar 2018 15:01:08 +0000 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads In-Reply-To: <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> References: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> Message-ID: <679be18ca4ea4a29b0ba8cb5f49d0f1b@R01UKEXCASM223.r01.fujitsu.local> Hi Aaron, Thanks for the hint! :) Best regards, Tomasz Wolski > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Aaron Knister > Sent: Friday, March 16, 2018 3:52 PM > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread > configuration needs more threads > > Ah. You, my friend, have been struck by a smooth criminal. And by smooth > criminal I mean systemd. I ran into this last week and spent many hours > banging my head against the wall trying to figure it out. > > systemd by default limits cgroups to I think 512 tasks and since a thread > counts as a task that's likely what you're running into. > > Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then > reboot (and yes, I mean reboot. changing it live doesn't seem possible > because of the infinite wisdom of the systemd developers). > > The pid limit of a given slice/unit cgroup may already be overriden to > something more reasonable than the 512 default so if, for example, you > were logging in and startng it via ssh the limit may be different than if its > started from the gpfs.service unit because mmfsd effectively is running in > different cgroups in each case. > > Hope that helps! > > -Aaron > > On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote: > > Hello GPFS Team, > > > > We are observing strange behavior of GPFS during startup on SLES12 node. > > > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a > > base and when GPFS starts for the first time on this node, it > > complains about > > > > too little NSD threads: > > > > .. > > > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > > {Version: 4.2.3.7?? Built: Feb 15 2018 11:38:38} ... > > > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > > > .. > > > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > > more threads, exceeds max thread count 1024_* > > > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting > down. > > > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > > initialize network shared disks > > > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > > > Removing old /var/adm/ras/mmfs.log.* files: > > > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > > before restarting mmfsd > > > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > > > GPFS starts loop and tries to respawn mmfsd periodically: > > > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 > seconds > > before restarting mmfsd_* > > > > It seems that this issue can be resolved by doing mmshutdown. Later, > > when we manually perform mmstartup the problem is gone. > > > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running > > SLES11 SP4. Only on VLP1 we installed SLES12 SP3. > > > > The test cluster looks as below: > > > > Node? Daemon node name? IP address?????? Admin node name? Designation > > > > ---------------------------------------------------------------------- > > - > > > > ?? 1?? VLP0.cs-intern??? 192.168.101.210? VLP0.cs-intern > > quorum-manager-snmp_collector > > > > ?? 2?? VLP1.cs-intern??? 192.168.101.211? VLP1.cs-intern > > quorum-manager > > > > ?? 3?? TBP0.cs-intern??? 192.168.101.215? TBP0.cs-intern?? quorum > > > > ?? 4?? IDP0.cs-intern??? 192.168.101.110? IDP0.cs-intern > > > > ?? 5?? IDP1.cs-intern??? 192.168.101.111? IDP1.cs-intern > > > > ?? 6?? IDP2.cs-intern??? 192.168.101.112? IDP2.cs-intern > > > > ?? 7?? IDP3.cs-intern??? 192.168.101.113? IDP3.cs-intern > > > > ?? 8?? ICP0.cs-intern??? 192.168.101.10?? ICP0.cs-intern > > > > ?? 9?? ICP1.cs-intern??? 192.168.101.11?? ICP1.cs-intern > > > > ? 10?? ICP2.cs-intern??? 192.168.101.12?? ICP2.cs-intern > > > > ? 11?? ICP3.cs-intern??? 192.168.101.13?? ICP3.cs-intern > > > > ? 12?? ICP4.cs-intern??? 192.168.101.14?? ICP4.cs-intern > > > > ? 13?? ICP5.cs-intern??? 192.168.101.15?? ICP5.cs-intern > > > > We have enabled traces and reproduced the issue as follows: > > > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > > files from this period you can find in uploaded archive under > > *_1_nsd_threads_problem_* directory > > > > 2.We have manually stopped the ?respawn? loop on VLP1 by executing > > mmshutdown and start GPFS manually by mmstartup. All traces from this > > execution can be found in archive file under > *_2_mmshutdown_mmstartup > > _*directory > > > > All data related to this problem is uploaded to our ftp to file: > > > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > > , (fe_cs_oem, 12Monkeys) > > item435_nsd_threads.tar.gz > > > > Could you please have a look at this problem? > > > > Best regards, > > > > Tomasz Wolski > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From secretary at gpfsug.org Tue Mar 20 08:48:19 2018 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Tue, 20 Mar 2018 08:48:19 +0000 Subject: [gpfsug-discuss] Upcoming meetings Message-ID: <785558aa15b26dbd44c9e22de3b13ef9@webmail.gpfsug.org> Dear members, There are a number of opportunities over the coming weeks for you to meet face to face with other group members and hear from Spectrum Scale experts. We'd love to see you at one of the events! If you plan to attend, please register: Spectrum Scale Usergroup, Singapore, March 26, https://www.eventbrite.com/e/spectrum-scale-user-group-singapore-march-2018-tickets-40429354287 [1] UK 2018 User Group Event, London, April 18 - April 19, https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList [2] IBM Technical University: Spectrum Scale Meet Up, London, May 14 Please email Par Hettinga par at nl.ibm.com USA 2018 Spectrum Scale User Group, Boston, May 16 - May 17, https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist [3] Thanks for your support, Claire -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://www.eventbrite.com/e/spectrum-scale-user-group-singapore-march-2018-tickets-40429354287 [2] https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList [3] https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist -------------- next part -------------- An HTML attachment was scrubbed... URL: From willi.engeli at id.ethz.ch Wed Mar 21 16:04:10 2018 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Wed, 21 Mar 2018 16:04:10 +0000 Subject: [gpfsug-discuss] CTDB RFE opened @ IBM Would like to ask for your votes Message-ID: Dear Collegues, [WE] I have missed the discussion on the CTDB upgradeability with interruption free methods. However, I hit this topic as well and some of our users where hit by the short interruption badly because of the kind of work they had running. This motivated me to open an Request for Enhancement for CTDB to support in a future release the interruption-less Upgrade. Here is the Link for the RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117919 I hope this time it works at 1. Place...... Thanks in advance Willi -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From puthuppu at iu.edu Wed Mar 21 17:30:19 2018 From: puthuppu at iu.edu (Uthuppuru, Peter K) Date: Wed, 21 Mar 2018 17:30:19 +0000 Subject: [gpfsug-discuss] Hello Message-ID: <857be7f3815441c0a8e55816e61b6735@BL-CCI-D2S08.ads.iu.edu> Hello all, My name is Peter Uthuppuru and I work at Indiana University on the Research Storage team. I'm new to GPFS, HPC, etc. so I'm excited to learn more. Thanks, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5615 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Fri Mar 23 12:59:51 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 23 Mar 2018 12:59:51 +0000 Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for Speakers and Registration Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D@nuance.com> Reminder: The registration for the Spring meeting of the SSUG-USA is now open. This is a Free two-day and will include a large number of Spectrum Scale updates and breakout tracks. Please note that we have limited meeting space so please register early if you plan on attending. If you are interested in presenting, please contact me. We do have a few more slots for user presentations ? these do not need to be long. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Caron at sig.com Fri Mar 23 20:10:05 2018 From: Paul.Caron at sig.com (Caron, Paul) Date: Fri, 23 Mar 2018 20:10:05 +0000 Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <980f9a27290342e28197154b924d0abf@msx.bala.susq.com> Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" option for the file system is "cluster" * We have a pretty small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From G.Horton at bham.ac.uk Mon Mar 26 12:25:26 2018 From: G.Horton at bham.ac.uk (Gareth Horton) Date: Mon, 26 Mar 2018 11:25:26 +0000 Subject: [gpfsug-discuss] GPFS Encryption Message-ID: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Hi. All, I would be interested to hear if any members have experience implementing Encryption?, any gotchas, tips or any other information which may help with the preparation and implementation stages would be appreciated. I am currently reading through the documentation and reviewing the preparation steps, and with a scheduled maintenance window on the horizon it would be a good opportunity to carry out any preparatory steps requiring an outage. If there are any aspects of the configuration which in hindsight could have been done at the preparation stage this would be especially useful. Many Thanks Gareth ---------------------- On 24/03/2018, 12:00, "gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org" wrote: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Reminder - SSUG-US Spring meeting - Call for Speakers and Registration (Oesterlin, Robert) 2. Pool layoutMap option changes following GPFS upgrades (Caron, Paul) ---------------------------------------------------------------------- Message: 1 Date: Fri, 23 Mar 2018 12:59:51 +0000 From: "Oesterlin, Robert" To: gpfsug main discussion list Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for Speakers and Registration Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D at nuance.com> Content-Type: text/plain; charset="utf-8" Reminder: The registration for the Spring meeting of the SSUG-USA is now open. This is a Free two-day and will include a large number of Spectrum Scale updates and breakout tracks. Please note that we have limited meeting space so please register early if you plan on attending. If you are interested in presenting, please contact me. We do have a few more slots for user presentations ? these do not need to be long. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 23 Mar 2018 20:10:05 +0000 From: "Caron, Paul" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <980f9a27290342e28197154b924d0abf at msx.bala.susq.com> Content-Type: text/plain; charset="us-ascii" Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" option for the file system is "cluster" * We have a pretty small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 74, Issue 45 ********************************************** From chair at spectrumscale.org Mon Mar 26 12:52:26 2018 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Mon, 26 Mar 2018 12:52:26 +0100 Subject: [gpfsug-discuss] RFE Process ... Burning Issues Message-ID: <563267E8-EAE7-4C73-BA54-266DDE94AB02@spectrumscale.org> Hi All, We?ve been talking with product management about the RFE process and have agreed that we?ll try out a community-voting process. First up, we are piloting this idea, hopefully it will work out, but it may also need tweaks as we move forward. One of the things we?ve been asking for is for a better way for the Spectrum Scale user group community to vote on RFEs. Sure we get people posting to the list, but we?re looking at if we can make it a better/more formal process to support this. Talking with IBM, we also recognise that with a large number of RFEs, it can be difficult for them to track work tasks being completed, but with the community RFEs, there is a commitment to try and track them closely and report back on progress later in the year. To submit an RFE using this process, you must complete the form available at: https://ibm.box.com/v/EnhBlitz (Enhancement Blitz template v1.pptx) The form provides some guidance on a good and bad RFE. Sure a lot of us are techie/engineers, so please try to explain what problem you are solving rather than trying to provide a solution. (i.e. leave the technical implementation details to those with the source code). Each site is limited to 2 submissions and they will be looked over by the Spectrum Scale community leaders, we may ask people to merge requests, send back for more info etc, or there may be some that we know will just never be progressed for various reasons. At the April user group in the UK, we have an RFE (Burning issues) session planned. Submitters of the RFE will be expected to provide a 1-3 minute pitch for their RFE. We?ve placed the session at the end of the day (UK time) to try and ensure USA people can participate. Remote presentation of your RFE is fine and we plan to live-stream the session. Each person will have 3 votes to choose what they think are their highest priority requests. Again remote voting is perfectly fine but only 3 votes per person. The requests with the highest number of votes will then be given a higher chance of being implemented. There?s a possibility that some may even make the winter release cycle. Either way, we plan to track the ?chosen? RFEs more closely and provide an update at the November USA meeting (likely the SC18 one). The submission and voting process is also planned to be run again in time for the November meeting. Anyone wanting to submit an RFE for consideration should submit the form by email to rfe at spectrumscaleug.org *before* 13th April. We?ll be posting the submitted RFEs up at the box site as well, you are encouraged to visit the site regularly and check the submissions as you may want to contact the author of an RFE to provide more information/support the RFE. Anything received after this date will be held over to the November cycle. The earlier you submit, the better chance it has of being included (we plan to limit the number to be considered) and will give us time to review the RFE and come back for more information/clarification if needed. You must also be prepared to provide a 1-3 minute pitch for your RFE (in person or remote) for the UK user group meeting. You are welcome to submit any RFE you have already put into the RFE portal for this process to garner community votes for it. There is space on the form to provide the existing RFE number. If you have any comments on the process, you can also email them to rfe at spectrumscaleug.org as well. Thanks to Carl Zeite for supporting this plan? Get submitting! Simon (UK Group Chair) -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Mon Mar 26 13:14:35 2018 From: john.hearns at asml.com (John Hearns) Date: Mon, 26 Mar 2018 12:14:35 +0000 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> References: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Message-ID: Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Gareth Horton Sent: Monday, March 26, 2018 1:25 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GPFS Encryption Hi. All, I would be interested to hear if any members have experience implementing Encryption?, any gotchas, tips or any other information which may help with the preparation and implementation stages would be appreciated. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From S.J.Thompson at bham.ac.uk Mon Mar 26 13:46:47 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 26 Mar 2018 12:46:47 +0000 Subject: [gpfsug-discuss] GPFS Encryption Message-ID: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> John, I think we might need the decrypt key ... Simon ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. From jtucker at pixitmedia.com Mon Mar 26 13:48:56 2018 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 26 Mar 2018 13:48:56 +0100 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> References: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> Message-ID: <3c977368-266f-1ec3-bafc-03adf872bb4d@pixitmedia.com> Try.... http://www.rot13.com/ On 26/03/18 13:46, Simon Thompson (IT Research Support) wrote: > John, > > I think we might need the decrypt key ... > > Simon > > ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: > > Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Mar 26 13:19:11 2018 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 26 Mar 2018 08:19:11 -0400 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> References: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Message-ID: Hi Gareth: We have the spectrum archive product with encryption. It encrypts data on disk and tape...but not metadata. We originally had hoped to write small files with metadata...that does not happen with encryption. My guess is that the system pool(where metadata lives) cannot be encrypted. So you may pay a performance penalty for small files using encryption depending on what backends your data write policy. Eric On Mon, Mar 26, 2018 at 7:25 AM, Gareth Horton wrote: > Hi. All, > > I would be interested to hear if any members have experience implementing > Encryption?, any gotchas, tips or any other information which may help with > the preparation and implementation stages would be appreciated. > > I am currently reading through the documentation and reviewing the > preparation steps, and with a scheduled maintenance window on the horizon > it would be a good opportunity to carry out any preparatory steps requiring > an outage. > > If there are any aspects of the configuration which in hindsight could > have been done at the preparation stage this would be especially useful. > > Many Thanks > > Gareth > > ---------------------- > > On 24/03/2018, 12:00, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of gpfsug-discuss-request at spectrumscale.org" spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org> > wrote: > > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Reminder - SSUG-US Spring meeting - Call for Speakers and > Registration (Oesterlin, Robert) > 2. Pool layoutMap option changes following GPFS upgrades > (Caron, Paul) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 23 Mar 2018 12:59:51 +0000 > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for > Speakers and Registration > Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D at nuance.com> > Content-Type: text/plain; charset="utf-8" > > Reminder: The registration for the Spring meeting of the SSUG-USA is > now open. This is a Free two-day and will include a large number of > Spectrum Scale updates and breakout tracks. > > Please note that we have limited meeting space so please register > early if you plan on attending. If you are interested in presenting, please > contact me. We do have a few more slots for user presentations ? these do > not need to be long. > > You can register here: > > https://www.eventbrite.com/e/spectrum-scale-gpfs-user- > group-us-spring-2018-meeting-tickets-43662759489 > > DATE AND TIME > Wed, May 16, 2018, 9:00 AM ? > Thu, May 17, 2018, 5:00 PM EDT > > LOCATION > IBM Cambridge Innovation Center > One Rogers Street > Cambridge, MA 02142-1203 > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20180323/824dbcdc/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Fri, 23 Mar 2018 20:10:05 +0000 > From: "Caron, Paul" > To: "gpfsug-discuss at spectrumscale.org" > > Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS > upgrades > Message-ID: <980f9a27290342e28197154b924d0abf at msx.bala.susq.com> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > Has anyone run into a situation where the layoutMap option for a pool > changes from "scatter" to "cluster" following a GPFS software upgrade? We > recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to > 4.2.3.6. We noticed that the layoutMap option for two of our pools changed > following the upgrades. We didn't recreate the file system or any of the > pools. Further lab testing has revealed that the layoutMap option change > actually occurred during the first upgrade to 4.1.1.17, and it was simply > carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, > but they have told us that layoutMap option changes are impossible for > existing pools, and that a software upgrade couldn't do this. I sent the > results of my lab testing today, so I'm hoping to get a better response. > > We would rather not have to recreate all the pools, but it is starting > to look like that may be the only option to fix this. Also, it's unclear > if this could happen again during future upgrades. > > Here's some additional background. > > * The "-j" option for the file system is "cluster" > > * We have a pretty small cluster; just 13 nodes > > * When reproducing the problem, we noted that the layoutMap > option didn't change until the final node was upgraded > > * The layoutMap option changed before running the "mmchconfig > release=LATEST" and "mmchfs -V full" commands, so those don't seem to > be related to the problem > > Thanks, > > Paul C. > SIG > > > ________________________________ > > IMPORTANT: The information contained in this email and/or its > attachments is confidential. If you are not the intended recipient, please > notify the sender immediately by reply and immediately delete this message > and all its attachments. Any review, use, reproduction, disclosure or > dissemination of this message or any attachment by an unintended recipient > is strictly prohibited. Neither this message nor any attachment is intended > as or should be construed as an offer, solicitation or recommendation to > buy or sell any security or other financial instrument. Neither the sender, > his or her employer nor any of their respective affiliates makes any > warranties as to the completeness or accuracy of any of the information > contained herein or that this message or any of its attachments is free of > viruses. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20180323/181b0ac7/attachment-0001.html> > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 74, Issue 45 > ********************************************** > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Caron at sig.com Mon Mar 26 16:43:24 2018 From: Paul.Caron at sig.com (Caron, Paul) Date: Mon, 26 Mar 2018 15:43:24 +0000 Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <9b442159716e43f6a621c21f71067c0a@msx.bala.susq.com> By the way, the command to check the layoutMap option for your pools is "mmlspool all -L". Has anyone else noticed if this option changed during your GPFS software upgrades? Here's how our mmlspool output looked for our lab/test environment under GPFS Version 3.5.0-21: Pool: name = system poolID = 0 blockSize = 1024 KB usage = metadataOnly maxDiskSize = 16 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = writecache poolID = 65537 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.0 TB layoutMap = scatter allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = data poolID = 65538 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.2 TB layoutMap = scatter allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Here's the mmlspool output immediately after the upgrade to 4.1.1-17: Pool: name = system poolID = 0 blockSize = 1024 KB usage = metadataOnly maxDiskSize = 16 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = writecache poolID = 65537 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.0 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = data poolID = 65538 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.2 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 We also determined the following: * The layoutMap option changes back to "scatter" if we revert back to 3.5.0.21. It only changes back after the last node is downgraded. * Restarting GPFS under 4.1.1-17 (via mmshutdown and mmstartup) has no effect on layoutMap in the lab (as expected). So, a simple restart doesn't fix the problem. Our production and lab deployments are using SLES 11, SP3 (3.0.101-0.47.71-default). Thanks, Paul C. SIG From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Caron, Paul Sent: Friday, March 23, 2018 4:10 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" file system is "cluster" * We have a pretty option for the small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From JRLang at uwyo.edu Mon Mar 26 22:13:39 2018 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Mon, 26 Mar 2018 21:13:39 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Tue Mar 27 07:29:06 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Tue, 27 Mar 2018 06:29:06 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: <9a95b4b2c71748dfb4b39e23ffd4debf@SMXRF105.msg.hukrf.de> Hallo Jeff, you can check these with following cmd. mmfsadm dump nsdcksum Your in memory info is inconsistent with your descriptor structur on disk. The reason for this I had no idea. Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Jeffrey R. Lang Gesendet: Montag, 26. M?rz 2018 23:14 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Mar 27 07:44:29 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 27 Mar 2018 12:14:29 +0530 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: This means that the stripe group descriptor on the disk dcs3800u31b_lun7 is corrupted. As we maintain copies of the stripe group descriptor on other disks as well we can copy the good descriptor from one of those disks to this one. Please open a PMR and work with IBM support to get this fixed. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Jeffrey R. Lang" To: gpfsug main discussion list Date: 03/27/2018 04:15 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=3u8q7zs1oLvf23bMVLe5YO_0SFSILFiL1d85LRDp9aQ&s=lf2ivnySwvhLDS-AnJSbm6cWcpO2R-vdHOll5TvkBDU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeep.patil at in.ibm.com Tue Mar 27 12:53:50 2018 From: sandeep.patil at in.ibm.com (Sandeep Ramesh) Date: Tue, 27 Mar 2018 17:23:50 +0530 Subject: [gpfsug-discuss] Latest Technical Blogs on Spectrum Scale In-Reply-To: References: Message-ID: Dear User Group Members, In continuation , here are list of development blogs in the this quarter (Q1 2018). As discussed in User Groups, passing it along: GDPR Compliance and Unstructured Data Storage https://developer.ibm.com/storage/2018/03/27/gdpr-compliance-unstructure-data-storage/ IBM Spectrum Scale for Linux on IBM Z ? Release 5.0 features and highlights https://developer.ibm.com/storage/2018/03/09/ibm-spectrum-scale-linux-ibm-z-release-5-0-features-highlights/ Management GUI enhancements in IBM Spectrum Scale release 5.0.0 https://developer.ibm.com/storage/2018/01/18/gui-enhancements-in-spectrum-scale-release-5-0-0/ IBM Spectrum Scale 5.0.0 ? What?s new in NFS? https://developer.ibm.com/storage/2018/01/18/ibm-spectrum-scale-5-0-0-whats-new-nfs/ Benefits and implementation of Spectrum Scale sudo wrappers https://developer.ibm.com/storage/2018/01/15/benefits-implementation-spectrum-scale-sudo-wrappers/ IBM Spectrum Scale: Big Data and Analytics Solution Brief https://developer.ibm.com/storage/2018/01/15/ibm-spectrum-scale-big-data-analytics-solution-brief/ Variant Sub-blocks in Spectrum Scale 5.0 https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ Compression support in Spectrum Scale 5.0.0 https://developer.ibm.com/storage/2018/01/11/compression-support-spectrum-scale-5-0-0/ IBM Spectrum Scale Versus Apache Hadoop HDFS https://developer.ibm.com/storage/2018/01/10/spectrumscale_vs_hdfs/ ESS Fault Tolerance https://developer.ibm.com/storage/2018/01/09/ess-fault-tolerance/ Genomic Workloads ? How To Get it Right From Infrastructure Point Of View. https://developer.ibm.com/storage/2018/01/06/genomic-workloads-get-right-infrastructure-point-view/ IBM Spectrum Scale On AWS Cloud : This video explains how to deploy IBM Spectrum Scale on AWS. This solution helps the users who require highly available access to a shared name space across multiple instances with good performance, without requiring an in-depth knowledge of IBM Spectrum Scale. Detailed Demo : https://www.youtube.com/watch?v=6j5Xj_d0bh4 Brief Demo : https://www.youtube.com/watch?v=-aMQKPW_RfY. For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media From: Sandeep Ramesh/India/IBM To: gpfsug-discuss at spectrumscale.org Cc: Doris Conti/Poughkeepsie/IBM at IBMUS Date: 01/10/2018 12:13 PM Subject: Re: Latest Technical Blogs on Spectrum Scale Dear User Group Members, Here are list of development blogs in the last quarter. Passing it to this email group as Doris had got a feedback in the UG meetings to notify the members with the latest updates periodically. Genomic Workloads ? How To Get it Right From Infrastructure Point Of View. https://developer.ibm.com/storage/2018/01/06/genomic-workloads-get-right-infrastructure-point-view/ IBM Spectrum Scale Versus Apache Hadoop HDFS https://developer.ibm.com/storage/2018/01/10/spectrumscale_vs_hdfs/ ESS Fault Tolerance https://developer.ibm.com/storage/2018/01/09/ess-fault-tolerance/ IBM Spectrum Scale MMFSCK ? Savvy Enhancements https://developer.ibm.com/storage/2018/01/05/ibm-spectrum-scale-mmfsck-savvy-enhancements/ ESS Disk Management https://developer.ibm.com/storage/2018/01/02/ess-disk-management/ IBM Spectrum Scale Object Protocol On Ubuntu https://developer.ibm.com/storage/2018/01/01/ibm-spectrum-scale-object-protocol-ubuntu/ IBM Spectrum Scale 5.0 ? Whats new in Unified File and Object https://developer.ibm.com/storage/2017/12/20/ibm-spectrum-scale-5-0-whats-new-object/ A Complete Guide to ? Protocol Problem Determination Guide for IBM Spectrum Scale? ? Part 1 https://developer.ibm.com/storage/2017/12/19/complete-guide-protocol-problem-determination-guide-ibm-spectrum-scale-1/ IBM Spectrum Scale installation toolkit ? enhancements over releases https://developer.ibm.com/storage/2017/12/15/ibm-spectrum-scale-installation-toolkit-enhancements-releases/ Network requirements in an Elastic Storage Server Setup https://developer.ibm.com/storage/2017/12/13/network-requirements-in-an-elastic-storage-server-setup/ Co-resident migration with Transparent cloud tierin https://developer.ibm.com/storage/2017/12/05/co-resident-migration-transparent-cloud-tierin/ IBM Spectrum Scale on Hortonworks HDP Hadoop clusters : A Complete Big Data Solution https://developer.ibm.com/storage/2017/12/05/ibm-spectrum-scale-hortonworks-hdp-hadoop-clusters-complete-big-data-solution/ Big data analytics with Spectrum Scale using remote cluster mount & multi-filesystem support https://developer.ibm.com/storage/2017/11/28/big-data-analytics-spectrum-scale-using-remote-cluster-mount-multi-filesystem-support/ IBM Spectrum Scale HDFS Transparency Short Circuit Write Support https://developer.ibm.com/storage/2017/11/28/ibm-spectrum-scale-hdfs-transparency-short-circuit-write-support/ IBM Spectrum Scale HDFS Transparency Federation Support https://developer.ibm.com/storage/2017/11/27/ibm-spectrum-scale-hdfs-transparency-federation-support/ How to configure and performance tuning different system workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-different-system-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning Spark workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-spark-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning database workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-database-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning Hadoop workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/24/configure-performance-tuning-hadoop-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ IBM Spectrum Scale Sharing Nothing Cluster Performance Tuning https://developer.ibm.com/storage/2017/11/24/ibm-spectrum-scale-sharing-nothing-cluster-performance-tuning/ How to Configure IBM Spectrum Scale? with NIS based Authentication. https://developer.ibm.com/storage/2017/11/21/configure-ibm-spectrum-scale-nis-based-authentication/ For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media From: Sandeep Ramesh/India/IBM To: gpfsug-discuss at spectrumscale.org Cc: Doris Conti/Poughkeepsie/IBM at IBMUS Date: 11/16/2017 08:15 PM Subject: Latest Technical Blogs on Spectrum Scale Dear User Group members, Here are the Development Blogs in last 3 months on Spectrum Scale Technical Topics. Spectrum Scale Monitoring ? Know More ? https://developer.ibm.com/storage/2017/11/16/spectrum-scale-monitoring-know/ IBM Spectrum Scale 5.0 Release ? What?s coming ! https://developer.ibm.com/storage/2017/11/14/ibm-spectrum-scale-5-0-release-whats-coming/ Four Essentials things to know for managing data ACLs on IBM Spectrum Scale? from Windows https://developer.ibm.com/storage/2017/11/13/four-essentials-things-know-managing-data-acls-ibm-spectrum-scale-windows/ GSSUTILS: A new way of running SSR, Deploying or Upgrading ESS Server https://developer.ibm.com/storage/2017/11/13/gssutils/ IBM Spectrum Scale Object Authentication https://developer.ibm.com/storage/2017/11/02/spectrum-scale-object-authentication/ Video Surveillance ? Choosing the right storage https://developer.ibm.com/storage/2017/11/02/video-surveillance-choosing-right-storage/ IBM Spectrum scale object deep dive training with problem determination https://www.slideshare.net/SmitaRaut/ibm-spectrum-scale-object-deep-dive-training Spectrum Scale as preferred software defined storage for Ubuntu OpenStack https://developer.ibm.com/storage/2017/09/29/spectrum-scale-preferred-software-defined-storage-ubuntu-openstack/ IBM Elastic Storage Server 2U24 Storage ? an All-Flash offering, a performance workhorse https://developer.ibm.com/storage/2017/10/06/ess-5-2-flash-storage/ A Complete Guide to Configure LDAP-based authentication with IBM Spectrum Scale? for File Access https://developer.ibm.com/storage/2017/09/21/complete-guide-configure-ldap-based-authentication-ibm-spectrum-scale-file-access/ Deploying IBM Spectrum Scale on AWS Quick Start https://developer.ibm.com/storage/2017/09/18/deploy-ibm-spectrum-scale-on-aws-quick-start/ Monitoring Spectrum Scale Object metrics https://developer.ibm.com/storage/2017/09/14/monitoring-spectrum-scale-object-metrics/ Tier your data with ease to Spectrum Scale Private Cloud(s) using Moonwalk Universal https://developer.ibm.com/storage/2017/09/14/tier-data-ease-spectrum-scale-private-clouds-using-moonwalk-universal/ Why do I see owner as ?Nobody? for my export mounted using NFSV4 Protocol on IBM Spectrum Scale?? https://developer.ibm.com/storage/2017/09/08/see-owner-nobody-export-mounted-using-nfsv4-protocol-ibm-spectrum-scale/ IBM Spectrum Scale? Authentication using Active Directory and LDAP https://developer.ibm.com/storage/2017/09/01/ibm-spectrum-scale-authentication-using-active-directory-ldap/ IBM Spectrum Scale? Authentication using Active Directory and RFC2307 https://developer.ibm.com/storage/2017/09/01/ibm-spectrum-scale-authentication-using-active-directory-rfc2307/ High Availability Implementation with IBM Spectrum Virtualize and IBM Spectrum Scale https://developer.ibm.com/storage/2017/08/30/high-availability-implementation-ibm-spectrum-virtualize-ibm-spectrum-scale/ 10 Frequently asked Questions on configuring Authentication using AD + AUTO ID mapping on IBM Spectrum Scale?. https://developer.ibm.com/storage/2017/08/04/10-frequently-asked-questions-configuring-authentication-using-ad-auto-id-mapping-ibm-spectrum-scale/ IBM Spectrum Scale? Authentication using Active Directory https://developer.ibm.com/storage/2017/07/30/ibm-spectrum-scale-auth-using-active-directory/ Five cool things that you didn?t know Transparent Cloud Tiering on Spectrum Scale can do https://developer.ibm.com/storage/2017/07/29/five-cool-things-didnt-know-transparent-cloud-tiering-spectrum-scale-can/ IBM Spectrum Scale GUI videos https://developer.ibm.com/storage/2017/07/25/ibm-spectrum-scale-gui-videos/ IBM Spectrum Scale? Authentication ? Planning for NFS Access https://developer.ibm.com/storage/2017/07/24/ibm-spectrum-scale-planning-nfs-access/ For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media -------------- next part -------------- An HTML attachment was scrubbed... URL: From bipcuds at gmail.com Tue Mar 27 23:26:16 2018 From: bipcuds at gmail.com (Keith Ball) Date: Tue, 27 Mar 2018 18:26:16 -0400 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Message-ID: Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Mar 28 00:44:33 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 27 Mar 2018 23:44:33 +0000 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? In-Reply-To: References: Message-ID: <7ae89940fa234b79b3538be339109cba@jumptrading.com> What version of GPFS are you running Keith? All nodes mounting the file system must briefly quiesce I/O operations during the snapshot create operations, hence the ?Quiescing all file system operations.? message in the output. So don?t really see a way to specify a specific set of nodes for these operations. They have made updates in newer releases of GPFS to combine operations (e.g. create and delete snapshots at the same time) which IBM says ?system performance is increased by batching operations and reducing overhead.?. Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can help them respond more quickly to quiesce I/O requests. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Tuesday, March 27, 2018 5:26 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dwayne.Hart at med.mun.ca Wed Mar 28 15:56:55 2018 From: Dwayne.Hart at med.mun.ca (Dwayne.Hart at med.mun.ca) Date: Wed, 28 Mar 2018 14:56:55 +0000 Subject: [gpfsug-discuss] Introduction to the "gpfsug-discuss" mailing list Message-ID: Hi, My name is Dwayne Hart. I currently work for the Center for Health Informatics & Analytics (CHIA), Faculty of Medicine at Memorial University of Newfoundland as a Systems/Network Security Administrator. In this role I am responsible for several HPC (Intel and Power) instances, OpenStack cloud environment and research data. We leverage IBM Spectrum Scale Storage as our primary storage solution. I have been working with GPFS since 2015. Best, Dwayne --- Systems Administrator Center for Health Informatics & Analytics (CHIA) Craig L. Dobbin Center for Genetics Room 4M409 300 Prince Philip Dr. St. John?s, NL Canada A1B 3V6 Tel: (709) 864-6631 E Mail: dwayne.hart at med.mun.ca -------------- next part -------------- An HTML attachment was scrubbed... URL: From ingo.altenburger at id.ethz.ch Thu Mar 29 13:20:45 2018 From: ingo.altenburger at id.ethz.ch (Altenburger Ingo (ID SD)) Date: Thu, 29 Mar 2018 12:20:45 +0000 Subject: [gpfsug-discuss] REST API function for 'mmsmb exportacl list' Message-ID: We were very hopeful to replace our storage provisioning automation based on cli commands with the new functions provided in REST API. Since it seems that almost all protocol related commands are already implemented with 5.0.0.1 REST interface, we have still not found an equivalent for mmsmb exportacl list to get the share permissions of a share. Does anybody know that this is already in but not yet documented or is it for sure still not under consideration? Thanks Ingo -------------- next part -------------- An HTML attachment was scrubbed... URL: From delmard at br.ibm.com Thu Mar 29 14:41:53 2018 From: delmard at br.ibm.com (Delmar Demarchi) Date: Thu, 29 Mar 2018 10:41:53 -0300 Subject: [gpfsug-discuss] AFM-DR Questions Message-ID: Hello experts. We have a Scale project with AFM-DR to be implemented and after read the KC documentation, we have some questions about. - Do you know any reason why we changed the Recovery point objective (RPO) snapshots by 15 to 720 minutes in the version 5.0.0 of IBM Spectrum Scale AFM-DR? - Can we use additional Independent Peer-snapshots to reduce the RPO interval (720 minutes) of IBM Spectrum Scale AFM-DR? - In addition to the above question, can we use these snapshots to update the new primary site after a failover occur for the most up to date snapshot? - According to the documentation, we are not able to replicate Dependent filesets, but if these dependents filesets are part of an existing Independent fileset. Do you see any issues/concerns with this? Thank you in advance. Delmar Demarchi .'. (delmard at br.ibm.com) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Mar 29 17:00:57 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Mar 2018 16:00:57 +0000 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <3c977368-266f-1ec3-bafc-03adf872bb4d@pixitmedia.com> Message-ID: I tried a dictionary attack, but ?nalguvta? was a typo. Should have been: ?Fbeel Tnergu. Pnaabg nqq nalguvat hfrshy urer? ? John: anythign (sic) to add? :-) Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com > On 26 Mar 2018, at 14:49, Jez Tucker wrote: > > Try.... http://www.rot13.com/ > >> On 26/03/18 13:46, Simon Thompson (IT Research Support) wrote: >> John, >> >> I think we might need the decrypt key ... >> >> Simon >> >> ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: >> >> Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Jez Tucker > Head of Research and Development, Pixit Media > 07764193820 | jtucker at pixitmedia.com > www.pixitmedia.com | Tw:@pixitmedia.com > > > This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: From bipcuds at gmail.com Thu Mar 29 17:15:19 2018 From: bipcuds at gmail.com (Keith Ball) Date: Thu, 29 Mar 2018 12:15:19 -0400 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Message-ID: You're right, Brian, the key load will be on the filesystem manager in any case, and as you say, all nodes nodes must quiesce - it's not really an issue of where to run the command, like it would be for mmfsck, etc. GPFS version is 3.5.0.26. We'll investigate upgrade to later version that accommodates combined operations. I will also look into the cgroups approach - is this a documented thing, or just something that people have tinkered with/hand rolled? Thanks, Keith On Wed, Mar 28, 2018 at 7:00 AM, > > What version of GPFS are you running Keith? > > All nodes mounting the file system must briefly quiesce I/O operations > during the snapshot create operations, hence the ?Quiescing all file system > operations.? message in the output. So don?t really see a way to specify a > specific set of nodes for these operations. They have made updates in > newer releases of GPFS to combine operations (e.g. create and delete > snapshots at the same time) which IBM says ?system performance is increased > by batching operations and reducing overhead.?. > > Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU > and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can > help them respond more quickly to quiesce I/O requests. > > HTH, > -Bryan > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Keith Ball > Sent: Tuesday, March 27, 2018 5:26 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? > > Note: External Email > ________________________________ > Hi All, > Two questions on snapshots: > 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have > an "-N" option as "PIT" commands typically do. Is there any way to control > where threads for snapshot creation/deletion run? (I assume the filesystem > manager will always be involved regardless). > > 2.) When mmdelsnapshot hangs or times out, the error messages tend to > appear on client nodes, and not necessarily the node where mmdelsnapshot is > run from, not the FS manager. Besides telling all users "don't use any I/O" > when runnign these commands, are there ways that folks have found to avoid > hangs and timeouts of mmdelsnapshot? > FWIW our filesystem manager is probably overextended (replication factor 2 > on data+MD, 30 daily snapshots kept, a number of client clusters served, > plus the FS manager is also an NSD server). > > Many Thanks, > Keith > RedLine Performance Solutions LLC > > ________________________________ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Mar 29 18:33:30 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 29 Mar 2018 17:33:30 +0000 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? In-Reply-To: References: Message-ID: The cgroups are something we moved onto, which has helped a lot with GPFS Clients responding to necessary GPFS commands demanding a low latency response (e.g. mmcrsnapshots, byte range locks, quota reporting, etc). Good luck! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Thursday, March 29, 2018 11:15 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ You're right, Brian, the key load will be on the filesystem manager in any case, and as you say, all nodes nodes must quiesce - it's not really an issue of where to run the command, like it would be for mmfsck, etc. GPFS version is 3.5.0.26. We'll investigate upgrade to later version that accommodates combined operations. I will also look into the cgroups approach - is this a documented thing, or just something that people have tinkered with/hand rolled? Thanks, Keith On Wed, Mar 28, 2018 at 7:00 AM, What version of GPFS are you running Keith? All nodes mounting the file system must briefly quiesce I/O operations during the snapshot create operations, hence the ?Quiescing all file system operations.? message in the output. So don?t really see a way to specify a specific set of nodes for these operations. They have made updates in newer releases of GPFS to combine operations (e.g. create and delete snapshots at the same time) which IBM says ?system performance is increased by batching operations and reducing overhead.?. Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can help them respond more quickly to quiesce I/O requests. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Tuesday, March 27, 2018 5:26 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC ________________________________ ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Mar 30 08:35:33 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 30 Mar 2018 13:05:33 +0530 Subject: [gpfsug-discuss] AFM-DR Questions In-Reply-To: References: Message-ID: + Venkat to provide answers on AFM queries Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Delmar Demarchi" To: gpfsug-discuss at spectrumscale.org Date: 03/29/2018 07:12 PM Subject: [gpfsug-discuss] AFM-DR Questions Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello experts. We have a Scale project with AFM-DR to be implemented and after read the KC documentation, we have some questions about. - Do you know any reason why we changed the Recovery point objective (RPO) snapshots by 15 to 720 minutes in the version 5.0.0 of IBM Spectrum Scale AFM-DR? - Can we use additional Independent Peer-snapshots to reduce the RPO interval (720 minutes) of IBM Spectrum Scale AFM-DR? - In addition to the above question, can we use these snapshots to update the new primary site after a failover occur for the most up to date snapshot? - According to the documentation, we are not able to replicate Dependent filesets, but if these dependents filesets are part of an existing Independent fileset. Do you see any issues/concerns with this? Thank you in advance. Delmar Demarchi .'. (delmard at br.ibm.com)_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=nBTENLroUhlIPgOEVV1rqTmcYxRh7ErhZ7jLWdpprlY&s=V0Xb_-yxttxff7X31CfkaegWKSGc-1ehsXrDpdO5dTI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Mar 30 14:54:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 30 Mar 2018 13:54:01 +0000 Subject: [gpfsug-discuss] Tentative Agenda - SSUG-US Spring Meeting - May 16/17, Cambridge MA Message-ID: Here is the Tentative Agenda for the upcoming SSUG-US meeting. It?s close to final. I do have one (possibly two) spots for customer talks still open. This is a fantastic agenda, and a big thanks to Ulf Troppens at IBM for pulling together all the IBM speakers. Register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist Wednesday, May 16th 8:30 9:00 Registration and Networking 9:00 9:20 Welcome 9:20 9:45 Keynote: Cognitive Computing and Spectrum Scale 9:45 10:10 Spectrum Scale Big Data & Analytics Initiative 10:10 10:30 Customer Talk 10:30 10:45 Break 10:45 11:10 Spectrum Scale Cloud Initiative 11:10 11:35 Composable Infrastructure for Technical Computing 11:35 11:55 Customer Talk 11:55 12:00 Agenda 12:00 13:00 Lunch and Networking 13:00 13:30 What is new in Spectrum Scale 13:30 13:45 What is new in ESS? 13:45 14:15 File System Audit Log 14:15 14:45 Coffee and Networking 14:45 15:15 Lifting the 32 subblock limit 15:15 15:35 Customer Talk 15:35 16:05 Spectrum Scale CCR Internals 16:05 16:20 Break 16:20 16:40 Customer Talk 16:40 17:25 Field Update 17:25 18:15 Meet the Devs - Ask us Anything Evening Networking Event - TBD Thursday, May 17th 8:30 9:00 Kaffee und Networking 9:00 10:00 1) Life Science Track 2) System Health, Performance Monitoring & Call Home 3) Policy Engine Best Practices 10:00 11:00 1) Life Science Track 2) Big Data & Analytics 3) Multi-cloud with Transparent Cloud Tiering 11:00 12:00 1) Life Science Track 2) Cloud Deployments 3) Installation Best Practices 12:00 13:00 Lunch and Networking 13:00 13:20 Customer Talk 13:20 14:10 Network Best Practices 14:10 14:30 Customer Talk 14:30 15:00 Kaffee und Networking 15:00 16:00 Enhancements for CORAL Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Mar 30 17:15:13 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 30 Mar 2018 12:15:13 -0400 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 Message-ID: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem ?without? spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag ? ? ? ? ? ? ? ?value ? ? ? ? ? ? ? ? ? ?description ------------------- ------------------------ ----------------------------------- ?--subblocks-per-full-block 1024 ? ? ? ? ? ? Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Mar 30 17:45:41 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 30 Mar 2018 11:45:41 -0500 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Message-ID: Apparently, a small mistake in that developer works post. I always advise testing of new features on a scratchable system... Here's what I see on my test system: #mmcrfs mak -F /vds/mn.nsd -A no -T /mak -B 16M -f 1K -i 1K Value '1024' for option '-f' is out of range. Valid values are 4096 through 524288. # mmcrfs mak -F /vds/mn.nsd -A no -T /mak -B 16M -f 4K -i 1K (runs okay) # mmlsfs mak flag value description ------------------- ------------------------ ----------------------------------- -f 4096 Minimum fragment (subblock) size in bytes -i 1024 Inode size in bytes -I 32768 Indirect block size in bytes ... -B 16777216 Block size ... -V 18.00 (5.0.0.0) File system version ... --subblocks-per-full-block 4096 Number of subblocks per full block ... From: valleru at cbio.mskcc.org To: gpfsug main discussion list Date: 03/30/2018 12:21 PM Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem ?without? spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag value description ------------------- ------------------------ ----------------------------------- --subblocks-per-full-block 1024 Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=HNrrMTazEN37eiIyxj9LWFMt2v1vCWeYuAGeHXXgIN8&s=Q6RUpDte4cePcCa_VU9ClyOvHMwhOWg8H1sRVLv9ocU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Mar 30 18:47:27 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 30 Mar 2018 13:47:27 -0400 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Message-ID: <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> Thanks Mark, I did not know, we could explicitly mention sub-block size when creating File system. It is no-where mentioned in the ?man mmcrfs?. Is this a new GPFS 5.0 feature? Also, i see from the ?man mmcrfs? that the default sub-block size for 8M and 16M is 16K. +???????????????????????????????+???????????????????????????????+ | Block size ? ? ? ? ? ? ? ? ? ?| Subblock size ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 64 KiB ? ? ? ? ? ? ? ? ? ? ? ?| 2 KiB ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 128 KiB ? ? ? ? ? ? ? ? ? ? ? | 4 KiB ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 256 KiB, 512 KiB, 1 MiB, 2 ? ?| 8 KiB ? ? ? ? ? ? ? ? ? ? ? ? | | MiB, 4 MiB ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 8 MiB, 16 MiB ? ? ? ? ? ? ? ? | 16 KiB ? ? ? ? ? ? ? ? ? ? ? ?| +???????????????????????????????+???????????????????????????????+ And you could create more than 1024 sub-blocks per block? and 4k is size of sub-block for 16M? That is great, since 4K files will go into data pool, and anything less than 4K will go to system (metadata) pool? Do you think - there would be any performance degradation for reducing the sub-blocks to 4K - 8K, from the default 16K for 16M filesystem? If we are not loosing any blocks by choosing a bigger block-size (16M) for filesystem, why would we want to choose a smaller block-size for filesystem (4M)? What advantage would smaller block-size (4M) give, compared to 16M with performance since 16M filesystem could store small files and read small files too at the respective sizes? And Near Line Rotating disks would be happy with bigger block-size than smaller block-size i guess? Regards, Lohit On Mar 30, 2018, 12:45 PM -0400, Marc A Kaplan , wrote: > > subblock -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Mar 30 19:47:47 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 30 Mar 2018 13:47:47 -0500 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> Message-ID: Look at my example, again, closely. I chose the blocksize as 16M and subblock size as 4K and the inodesize as 1K.... Developer works is a good resource, but articles you read there may be incomplete or contain mistakes. The official IBM Spectrum Scale cmd and admin guide documents, are "trustworthy" but may not be perfect in all respects. "Trust but Verify" and YMMV. ;-) As for why/how to choose "good sizes", that depends what objectives you want to achieve, and "optimal" may depend on what hardware you are running. Run your own trials and/or ask performance experts. There are usually "tradeoffs" and OTOH when you get down to it, some choices may not be all-that-important in actual deployment and usage. That's why we have defaults values - try those first and leave the details and tweaking aside until you have good reason ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Thu Mar 1 11:26:12 2018 From: chair at spectrumscale.org (Simon Thompson) Date: Thu, 01 Mar 2018 11:26:12 +0000 Subject: [gpfsug-discuss] UK April meeting Message-ID: <26357FF0-F04B-4A37-A8A5-062CB0160D19@spectrumscale.org> Hi All, We?ve just posted the draft agenda for the UK meeting in April at: http://www.spectrumscaleug.org/event/uk-2018-user-group-event/ So far, we?ve issued over 50% of the available places, so if you are planning to attend, please do register now! Please register at: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList We?ve also confirmed our evening networking/social event between days 1 and 2 with thanks to our sponsors for supporting this. Please remember that we are currently limiting to two registrations per organisation. We?d like to thank our sponsors from DDN, E8, Ellexus, IBM, Lenovo, NEC and OCF for supporting the event. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Thu Mar 1 08:41:59 2018 From: john.hearns at asml.com (John Hearns) Date: Thu, 1 Mar 2018 08:41:59 +0000 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: In reply to Stuart, our setup is entirely Infiniband. We boot and install over IB, and rely heavily on IP over Infiniband. As for users being 'confused' due to multiple IPs, I would appreciate some more depth on that one. Sure, all batch systems are sensitive to hostnames (as I know to my cost!) but once you get that straightened out why should users care? I am not being aggressive, just keen to find out more. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Stuart Barkley Sent: Wednesday, February 28, 2018 6:50 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB The problem with CM is that it seems to require configuring IP over Infiniband. I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, but pulled it out of our environment as adding unneeded complexity. It requires provisioning IP addresses across the Infiniband infrastructure and possibly adding routers to other portions of the IP infrastructure. It was also confusing some users due to multiple IPs on the compute infrastructure. We have recently been in discussions with a vendor about their support for GPFS over IB and they kept directing us to using CM (which still didn't work). CM wasn't necessary once we found out about the actual problem (we needed the undocumented verbsRdmaUseGidIndexZero configuration option among other things due to their use of SR-IOV based virtual IB interfaces). We don't use routed Infiniband and it might be that CM and IPoIB is required for IB routing, but I doubt it. It sounds like the OP is keeping IB and IP infrastructure separate. Stuart Barkley On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > Date: Mon, 26 Feb 2018 14:16:34 > From: Aaron Knister > Reply-To: gpfsug main discussion list > > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > Hi Jan Erik, > > It was my understanding that the IB hardware router required RDMA CM to work. > By default GPFS doesn't use the RDMA Connection Manager but it can be > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > clients/servers (in both clusters) to take effect. Maybe someone else > on the list can comment in more detail-- I've been told folks have > successfully deployed IB routers with GPFS. > > -Aaron > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > Dear all > > > > we are currently trying to remote mount a file system in a routed > > Infiniband test setup and face problems with dropped RDMA > > connections. The setup is the > > following: > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > connected to the same infiniband network. Additionally they are > > connected to a fast ethernet providing ip communication in the network 192.168.11.0/24. > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > are connected to a second infiniband network. These servers have IPs > > on their IB interfaces in the network 192.168.12.0/24. > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > dedicated machine. > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > We tested that the routing, both IP and IB, is working between the > > two clusters without problems and that RDMA is working fine both for > > internal communication inside cluster 1 and cluster 2 > > > > When trying to remote mount a file system from cluster 1 in cluster > > 2, RDMA communication is not working as expected. Instead we see > > error messages on the remote host (cluster 2) > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 2 > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 2 > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 3 > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 3 > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 1 > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 1 > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 1 > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 0 > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 0 > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 0 > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 2 > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 2 > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 2 > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 3 > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 3 > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > > > > > and in the cluster with the file system (cluster 1) > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > Any advice on how to configure the setup in a way that would allow > > the remote mount via routed IB would be very appreciated. > > > > > > Thank you and best regards > > Jan Erik > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > YpqcNNP8%3D&reserved=0 > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > %3D&reserved=0 > -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From lavila at illinois.edu Thu Mar 1 15:02:24 2018 From: lavila at illinois.edu (Avila-Diaz, Leandro) Date: Thu, 1 Mar 2018 15:02:24 +0000 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. [Inactive hide details for "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is]"Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: From bzhang at ca.ibm.com Thu Mar 1 22:47:57 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Thu, 1 Mar 2018 17:47:57 -0500 Subject: [gpfsug-discuss] Spectrum Scale Support Webinar - File Audit Logging Message-ID: You are receiving this message because you are an IBM Spectrum Scale Client and in GPFS User Group. IBM Spectrum Scale Support Webinar File Audit Logging About this Webinar IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices based on various use cases. This webinar will discuss fundamentals of the new File Audit Logging function including configuration and key best practices that will aid you in successful deployment and use of File Audit Logging within Spectrum Scale. Please note that our webinars are free of charge and will be held online via WebEx. Agenda: ? Overview of File Audit Logging ? Installation and deployment of File Audit Logging ? Using File Audit Logging ? Monitoring and troubleshooting File Audit Logging ? Q&A NA/EU Session Date: March 14, 2018 Time: 11 AM ? 12PM EDT (4PM GMT) Registration: https://ibm.biz/BdZsZz Audience: Spectrum Scale Administrators AP/JP Session Date: March 15, 2018 Time: 10AM ? 11AM Beijing Time (11AM Tokyo Time) Registration: https://ibm.biz/BdZsZf Audience: Spectrum Scale Administrators If you have any questions, please contact Robert Simon, Jun Hui Bu, Vlad Spoiala and Bohai Zhang. Regards, IBM Spectrum Scale Support Team Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71100731.jpg Type: image/jpeg Size: 21904 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71151195.jpg Type: image/jpeg Size: 17787 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71943442.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71224521.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71350284.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71371859.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71584384.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71592777.gif Type: image/gif Size: 1243 bytes Desc: not available URL: From Greg.Lehmann at csiro.au Fri Mar 2 03:48:44 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 2 Mar 2018 03:48:44 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Message-ID: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won't run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Mar 2 05:15:21 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 2 Mar 2018 13:15:21 +0800 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: Hi, The verification/test work is still ongoing. Hopefully GPFS will publish statement soon. I think it would be available through several channels, such as FAQ. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Avila-Diaz, Leandro" To: gpfsug main discussion list Date: 03/01/2018 11:17 PM Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. Inactive hide details for "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is"Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=qFtjLJBRsEewfEfVZBW__Xk8CD9w04bJZpK0sJiCze0&s=LyDrwavwKGQHDl4DVW6-vpW2bjmJBtXrGGcFfDYyI4o&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 19119307.gif Type: image/gif Size: 106 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Fri Mar 2 16:33:46 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Fri, 2 Mar 2018 16:33:46 +0000 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: <6BBDFC67-D61F-4477-BF8A-1551925AF955@vanderbilt.edu> Hi Leandro, I think the silence in response to your question says a lot, don?t you? :-O IBM has said (on this list, I believe) that the Meltdown / Spectre patches do not impact GPFS functionality. They?ve been silent as to performance impacts, which can and will be taken various ways. In the absence of information from IBM, the approach we have chosen to take is to patch everything except our GPFS servers ? only we (the SysAdmins, oh, and the NSA, of course!) can log in to them, so we feel that the risk of not patching them is minimal. HTHAL? Kevin On Mar 1, 2018, at 9:02 AM, Avila-Diaz, Leandro > wrote: Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: > on behalf of IBM Spectrum Scale > Reply-To: gpfsug main discussion list > Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum athttps://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Ceec49ab3ce144a81db3d08d57f86b59d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636555138937139546&sdata=%2FFS%2FQzdMP4d%2Bgf4wCUPR7KOQxIIV6OABoaNrc0ySHdI%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Mar 5 15:01:28 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 5 Mar 2018 15:01:28 +0000 Subject: [gpfsug-discuss] More Details: US Spring Meeting - May 16-17th, Boston Message-ID: A few more details on the Spectrum Scale User Group US meeting. We are still finalizing the agenda, but expect two full days on presentations by IBM, users, and breakout sessions. We?re still looking for user presentations ? please contact me if you would like to present! Or if you have any topics that you?d like to see covered. Dates: Wednesday May 16th and Thursday May 17th Location: IBM Cambridge Innovation Center, One Rogers St , Cambridge, MA 02142-1203 (Near MIT and Boston) https://goo.gl/5oHSKo There are a number of nearby hotels. If you are considering coming, please book early. Boston has good public transport options, so if you book a bit farther out you may get a better price. More details on the agenda and a link to the sign-up coming in a few weeks. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Mar 5 23:49:04 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 5 Mar 2018 15:49:04 -0800 Subject: [gpfsug-discuss] RDMA data from Zimon In-Reply-To: References: Message-ID: <8EB2B774-1640-4AEA-A4ED-2D6DBEC3324E@lbl.gov> Thanks Eric. No one who is a ZIMon developer has jumped up to contradict this, so I?ll go with it :-) Many thanks. This is helpful to understand where the data is coming from and would be a welcome addition to the documentation. Cheers, Kristy > On Feb 15, 2018, at 9:08 AM, Eric Agar wrote: > > Kristy, > > I experimented a bit with this some months ago and looked at the ZIMon source code. I came to the conclusion that ZIMon is reporting values obtained from the IB counters (actually, delta values adjusted for time) and that yes, for port_xmit_data and port_rcv_data, one would need to multiply the values by 4 to make sense of them. > > To obtain a port_xmit_data value, the ZIMon sensor first looks for /sys/class/infiniband//ports//counters_ext/port_xmit_data_64, and if that is not found then looks for /sys/class/infiniband//ports//counters/port_xmit_data. Similarly for other counters/metrics. > > Full disclosure: I am not an IB expert nor a ZIMon developer. > > I hope this helps. > > > Eric M. Agar > agar at us.ibm.com > > > Kristy Kallback-Rose ---02/14/2018 08:47:59 PM---Hi, Can one of the IBMers tell me if port_xmit_data and port_rcv_data from Zimon can be interpreted > > From: Kristy Kallback-Rose > To: gpfsug main discussion list > Date: 02/14/2018 08:47 PM > Subject: [gpfsug-discuss] RDMA data from Zimon > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi, > > Can one of the IBMers tell me if port_xmit_data and port_rcv_data from Zimon can be interpreted as RDMA Bytes/sec? Ideally, also how this data is being collected? I?m looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1hlp_monnetworksmetrics.htm > > But then I also look here: https://community.mellanox.com/docs/DOC-2751 > > and see "Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.? So I wasn?t sure if some multiplication by 4 was in order. > > Please advise. > > Cheers, > Kristy_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zIRb70L9sx_FvvC9IcWVKLOSOOFnx-hIGfjw0kUN7bw&s=D1g4YTG5WeUiHI3rCPr_kkPxbG9V9E-18UGXBeCvfB8&e= > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Mar 6 12:49:26 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 6 Mar 2018 12:49:26 +0000 Subject: [gpfsug-discuss] tscCmdPortRange question Message-ID: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> We are looking at setting a value for tscCmdPortRange so that we can apply firewalls to a small number of GPFS nodes in one of our clusters. The docs don?t give an indication on the number of ports that are required to be in the range. Could anyone make a suggestion on this? It doesn?t appear as a parameter for ?mmchconfig -i?, so I assume that it requires the nodes to be restarted, however I?m not clear if we could do a rolling restart on this? Thanks Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Mar 6 18:48:40 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 6 Mar 2018 18:48:40 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: Thanks for raising this, I was going to ask. The last I heard it was baked into the 5.0 release of Scale but the release notes are eerily quiet on the matter. Would be good to get some input from IBM on this. Richard Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Greg.Lehmann at csiro.au Sent: Friday, March 2, 2018 3:48:44 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Mar 6 18:50:00 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 6 Mar 2018 18:50:00 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Mar 6 17:17:59 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 6 Mar 2018 17:17:59 +0000 Subject: [gpfsug-discuss] mmfind performance Message-ID: Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Mar 6 18:54:47 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 6 Mar 2018 18:54:47 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au>, Message-ID: The sales pitch my colleagues heard suggested it was already in v5.. That's a big shame to hear that we all misunderstood. Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Christof Schmitt Sent: Tuesday, March 6, 2018 6:50:00 PM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: Sent by: gpfsug-discuss-bounces at spectrumscale.org To: Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Mar 6 18:57:32 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 6 Mar 2018 18:57:32 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: , <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From dod2014 at med.cornell.edu Tue Mar 6 18:23:41 2018 From: dod2014 at med.cornell.edu (Douglas Duckworth) Date: Tue, 6 Mar 2018 13:23:41 -0500 Subject: [gpfsug-discuss] 100G RoCEE and Spectrum Scale Performance Message-ID: Hi We are currently running Spectrum Scale over FDR Infiniband. We plan on upgrading to EDR since I have not really encountered documentation saying to abandon the lower-latency advantage found in Infiniband. Our workloads generally benefit from lower latency. It looks like, ignoring GPFS, EDR still has higher throughput and lower latency when compared to 100G RoCEE. http://sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post149s2-file3.pdf However, I wanted to get feedback on how GPFS performs with 100G Ethernet instead of FDR. Thanks very much! Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Tue Mar 6 19:46:59 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 6 Mar 2018 20:46:59 +0100 Subject: [gpfsug-discuss] tscCmdPortRange question In-Reply-To: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> References: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> Message-ID: An HTML attachment was scrubbed... URL: From knop at us.ibm.com Tue Mar 6 23:11:38 2018 From: knop at us.ibm.com (Felipe Knop) Date: Tue, 6 Mar 2018 18:11:38 -0500 Subject: [gpfsug-discuss] tscCmdPortRange question In-Reply-To: References: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> Message-ID: Olaf, Correct. mmchconfig -i is accepted for tscCmdPortRange . The change should take place immediately, upon invocation of the next command. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Olaf Weiser" To: gpfsug main discussion list Date: 03/06/2018 02:47 PM Subject: Re: [gpfsug-discuss] tscCmdPortRange question Sent by: gpfsug-discuss-bounces at spectrumscale.org this parameter is just for administrative commands.. "where" to send the output of a command... and for those admin ports .. so called ephemeral ports... it depends , how much admin commands ( = sessions = sockets) you want to run in parallel in my experience.. 10 ports is more than enough we use those in a range from 50000-50010 to be clear .. demon - to - demon .. communication always uses 1191 cheers From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 03/06/2018 06:55 PM Subject: [gpfsug-discuss] tscCmdPortRange question Sent by: gpfsug-discuss-bounces at spectrumscale.org We are looking at setting a value for tscCmdPortRange so that we can apply firewalls to a small number of GPFS nodes in one of our clusters. The docs don?t give an indication on the number of ports that are required to be in the range. Could anyone make a suggestion on this? It doesn?t appear as a parameter for ?mmchconfig -i?, so I assume that it requires the nodes to be restarted, however I?m not clear if we could do a rolling restart on this? Thanks Simon_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=pezsJOeWDWSnEkh5d3dp175Vx4opvikABgoTzUt-9pQ&s=S_Qe62jYseR2Y2yjiovXwvVz3d2SFW-jCf0Pw5VB_f4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Mar 6 22:27:34 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 6 Mar 2018 17:27:34 -0500 Subject: [gpfsug-discuss] mmfind performance In-Reply-To: References: Message-ID: Please try: mmfind --polFlags '-N a_node_list -g /gpfs23/tmp' directory find-flags ... Where a_node_list is a node list of your choice and /gpfs23/tmp is a temp directory of your choice... And let us know how that goes. Also, you have chosen a special case, just looking for some inode numbers -- so find can skip stating the other inodes... whereas mmfind is not smart enough to do that -- but still with parallelism, I'd guess mmapplypolicy might still beat find in elapsed time to complete, even for this special case. -- Marc K of GPFS From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 03/06/2018 01:52 PM Subject: [gpfsug-discuss] mmfind performance Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=48WYhVkWI1kr_BM-Wg_VaXEOi7xfGusnZcJtkiA98zg&s=IXUhEC_thuGAVwGJ02oazCCnKEuAdGeg890fBelP4kE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Wed Mar 7 01:30:14 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 6 Mar 2018 20:30:14 -0500 Subject: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory In-Reply-To: <65453649-77df-2efa-8776-eb2775ca9efa@nasa.gov> References: <65453649-77df-2efa-8776-eb2775ca9efa@nasa.gov> Message-ID: Following up on this... On one of the nodes on which I'd bounced the pagepool around I managed to cause what appeared to that node as filesystem corruption (i/o errors and fsstruct errors) on every single fs. Thankfully none of the other nodes in the cluster seemed to agree that the fs was corrupt. I'll open a PMR on that but I thought it was interesting none the less. I haven't run an fsck on any of the filesystems but my belief is that they're OK since so far none of the other nodes in the cluster have complained. Secondly, I can see the pagepool allocations that align with registered verbs mr's (looking at mmfsadm dump verbs). In theory one can free an ib mr after registration as long as it's not in use but one has to track that and I could see that being a tricky thing (although in theory given the fact that GPFS has its own page allocator it might be relatively trivial to figure it out but it might also require re-establishing RDMA connections depending on whether or not a given QP is associated with a PD that uses the MR trying to be freed...I think that makes sense). Anyway, I'm wondering if the need to free the ib MR on pagepool shrink could be avoided all together by limiting the amount of memory that gets allocated to verbs MR's (e.g. something like verbsPagePoolMaxMB) so that those regions never need to be freed but the amount of memory available for user caching could grow and shrink as required. It's probably not that simple, though :) Another thought I had was doing something like creating a file in /dev/shm, registering it as a loopback device, and using that as an LROC device. I just don't think that's feasible at scale given the current method of LROC device registration (e.g. via the mmsdrfs file). I think there's much to be gained from the ability to dynamically change the memory-based file cache size on a per-job basis so I'm really hopeful we can find a way to make this work. -Aaron On 2/25/18 11:45 AM, Aaron Knister wrote: > Hmm...interesting. It sure seems to try :) > > The pmap command was this: > > pmap $(pidof mmfsd) | sort -n -k3 | tail > > -Aaron > > On 2/23/18 9:35 AM, IBM Spectrum Scale wrote: >> AFAIK you can increase the pagepool size dynamically but you cannot >> shrink it dynamically. ?To shrink it you must restart the GPFS daemon. >> Also, could you please provide the actual pmap commands you executed? >> >> Regards, The Spectrum Scale (GPFS) team >> >> ------------------------------------------------------------------------------------------------------------------ >> >> If you feel that your question can benefit other users of ?Spectrum >> Scale (GPFS), then please post it to the public IBM developerWroks >> Forum at >> https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please >> contact ??1-800-237-5511 in the United States or your local IBM >> Service Center in other countries. >> >> The forum is informally monitored as time permits and should not be >> used for priority messages to the Spectrum Scale (GPFS) team. >> >> >> >> From: Aaron Knister >> To: >> Date: 02/22/2018 10:30 PM >> Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all memory >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ------------------------------------------------------------------------ >> >> >> >> This is also interesting (although I don't know what it really means). >> Looking at pmap run against mmfsd I can see what happens after each step: >> >> # baseline >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020000000000 1048576K 1048576K 1048576K 1048576K ? ? ?0K rwxp [anon] >> Total: ? ? ? ? ? 1613580K 1191020K 1189650K 1171836K ? ? ?0K >> >> # tschpool 64G >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020000000000 67108864K 67108864K 67108864K 67108864K ?0K rwxp [anon] >> Total: ? ? ? ? ? 67706636K 67284108K 67282625K 67264920K ? ? ?0K >> >> # tschpool 1G >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020001400000 139264K 139264K 139264K 139264K ? ? ?0K rwxp [anon] >> 0000020fc9400000 897024K 897024K 897024K 897024K ? ? ?0K rwxp [anon] >> 0000020009c00000 66052096K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K rwxp [anon] >> Total: ? ? ? ? ? 67706636K 1223820K 1222451K 1204632K ? ? ?0K >> >> Even though mmfsd has that 64G chunk allocated there's none of it >> *used*. I wonder why Linux seems to be accounting it as allocated. >> >> -Aaron >> >> On 2/22/18 10:17 PM, Aaron Knister wrote: >> ?> I've been exploring the idea for a while of writing a SLURM SPANK >> plugin >> ?> to allow users to dynamically change the pagepool size on a node. >> Every >> ?> now and then we have some users who would benefit significantly from a >> ?> much larger pagepool on compute nodes but by default keep it on the >> ?> smaller side to make as much physmem available as possible to batch >> work. >> ?> >> ?> In testing, though, it seems as though reducing the pagepool doesn't >> ?> quite release all of the memory. I don't really understand it because >> ?> I've never before seen memory that was previously resident become >> ?> un-resident but still maintain the virtual memory allocation. >> ?> >> ?> Here's what I mean. Let's take a node with 128G and a 1G pagepool. >> ?> >> ?> If I do the following to simulate what might happen as various jobs >> ?> tweak the pagepool: >> ?> >> ?> - tschpool 64G >> ?> - tschpool 1G >> ?> - tschpool 32G >> ?> - tschpool 1G >> ?> - tschpool 32G >> ?> >> ?> I end up with this: >> ?> >> ?> mmfsd thinks there's 32G resident but 64G virt >> ?> # ps -o vsz,rss,comm -p 24397 >> ?> ??? VSZ?? RSS COMMAND >> ?> 67589400 33723236 mmfsd >> ?> >> ?> however, linux thinks there's ~100G used >> ?> >> ?> # free -g >> ?> total?????? used free???? shared??? buffers cached >> ?> Mem:?????????? 125 100???????? 25 0????????? 0 0 >> ?> -/+ buffers/cache: 98???????? 26 >> ?> Swap: 7????????? 0 7 >> ?> >> ?> I can jump back and forth between 1G and 32G *after* allocating 64G >> ?> pagepool and the overall amount of memory in use doesn't balloon but I >> ?> can't seem to shed that original 64G. >> ?> >> ?> I don't understand what's going on... :) Any ideas? This is with Scale >> ?> 4.2.3.6. >> ?> >> ?> -Aaron >> ?> >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=OrZQeEmI6chBdguG-h4YPHsxXZ4gTU3CtIuN4e3ijdY&s=hvVIRG5kB1zom2Iql2_TOagchsgl99juKiZfJt5S1tM&e= >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Greg.Lehmann at csiro.au Tue Mar 6 23:36:12 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Tue, 6 Mar 2018 23:36:12 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 7 13:45:24 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 7 Mar 2018 13:45:24 +0000 Subject: [gpfsug-discuss] mmfind performance Message-ID: <90F48570-7294-4032-8A6A-73DD51169A55@bham.ac.uk> I can?t comment on mmfind vs perl, but have you looked at trying ?tsfindinode? ? Simon From: on behalf of "Buterbaugh, Kevin L" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 6 March 2018 at 18:52 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] mmfind performance Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Mar 7 15:18:24 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 7 Mar 2018 15:18:24 +0000 Subject: [gpfsug-discuss] mmfind performance In-Reply-To: References: Message-ID: Hi Marc, Thanks, I?m going to give this a try as the first mmfind finally finished overnight, but produced no output: /root root at gpfsmgrb# bash -x ~/bin/klb.sh + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls /root root at gpfsmgrb# BTW, I had put that in a simple script simply because I had a list of those inodes and it was easier for me to get that in the format I wanted via a script that I was editing than trying to do that on the command line. However, in the log file it was producing it ?hit? on 48 files: [I] Inodes scan: 978275821 files, 99448202 directories, 37189547 other objects, 1967508 'skipped' files and/or errors. [I] 2018-03-06 at 23:43:15.988 Policy evaluation. 1114913570 files scanned. [I] 2018-03-06 at 23:43:16.016 Sorting 48 candidate file list records. [I] 2018-03-06 at 23:43:16.040 Sorting 48 candidate file list records. [I] 2018-03-06 at 23:43:16.065 Choosing candidate files. 0 records scanned. [I] 2018-03-06 at 23:43:16.066 Choosing candidate files. 48 records scanned. [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 48 1274453504 48 1274453504 0 RULE 'mmfind' LIST 'mmfindList' DIRECTORIES_PLUS SHOW(.) WHERE(.) [I] Filesystem objects with no applicable rules: 1112946014. [I] GPFS Policy Decisions and File Choice Totals: Chose to list 1274453504KB: 48 of 48 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 564722407424 624917749760 90.367477583% gpfs23data 304797672448 531203506176 57.378701177% system 0 0 0.000000000% (no user data) [I] 2018-03-06 at 23:43:16.066 Policy execution. 0 files dispatched. [I] 2018-03-06 at 23:43:16.102 Policy execution. 0 files dispatched. [I] A total of 0 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. While I?m going to follow your suggestion next, if you (or anyone else on the list) can explain why the ?Hit_Cnt? is 48 but the ?-ls? I passed to mmfind didn?t result in anything being listed, my curiosity is piqued. And I?ll go ahead and say it before someone else does ? I haven?t just chosen a special case, I AM a special case? ;-) Kevin On Mar 6, 2018, at 4:27 PM, Marc A Kaplan > wrote: Please try: mmfind --polFlags '-N a_node_list -g /gpfs23/tmp' directory find-flags ... Where a_node_list is a node list of your choice and /gpfs23/tmp is a temp directory of your choice... And let us know how that goes. Also, you have chosen a special case, just looking for some inode numbers -- so find can skip stating the other inodes... whereas mmfind is not smart enough to do that -- but still with parallelism, I'd guess mmapplypolicy might still beat find in elapsed time to complete, even for this special case. -- Marc K of GPFS From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 03/06/2018 01:52 PM Subject: [gpfsug-discuss] mmfind performance Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=48WYhVkWI1kr_BM-Wg_VaXEOi7xfGusnZcJtkiA98zg&s=IXUhEC_thuGAVwGJ02oazCCnKEuAdGeg890fBelP4kE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C724521c8034241913d8508d58412dcf8%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636560138922366489&sdata=faXozQ%2FGGDf8nARmk52%2B2W5eIEBfnYwNapJyH%2FagqIQ%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Mar 7 16:48:40 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 7 Mar 2018 17:48:40 +0100 Subject: [gpfsug-discuss] 100G RoCEE and Spectrum Scale Performance In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Mar 7 19:15:59 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 7 Mar 2018 14:15:59 -0500 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Wed Mar 7 21:53:34 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 7 Mar 2018 21:53:34 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Mar 8 09:41:56 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 8 Mar 2018 09:41:56 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: Whether or not you meant it your words ?that is not available today.? Implies that something is coming in the future? Would you be reliant on the Samba/CTDB development team or would you roll your own.. supposing it?s possible in the first place. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: 07 March 2018 21:54 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades The problem with the SMB upgrade is with the data shared between the protocol nodes. It is not tied to the protocol version used between SMB clients and the protocol nodes. Samba stores internal data (e.g. for the SMB state of open files) in tdb database files. ctdb then makes these tdb databases available across all protocol nodes. A concurrent upgrade for SMB would require correct handling of ctdb communications and the tdb records across multiple versions; that is not available today. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Wed, Mar 7, 2018 4:35 AM In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=7bQRypv0JL7swEYobmepaynWFvV8HYtBa2_S1kAyirk&s=2bUeeDk-8VbtwU8KtV4RlENEcbpOr_GQJQvR_8gt-ug&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Thu Mar 8 08:29:56 2018 From: john.hearns at asml.com (John Hearns) Date: Thu, 8 Mar 2018 08:29:56 +0000 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: On the subject of mmfind, I would like to find files which have the misc attributes relevant to AFM. For instance files which have the attribute 'v' The file is newly created, not yet copied to home I can write a policy to do this, and I have a relevant policy written. However I would like to do this using mmfind, which seems a nice general utility. This syntax does not work: mmfind /hpc -eaWithValue MISC_ATTRIBUTES===v Before anyone says it, I am mixing up MISC_ATTRIBUTES and extended attributes! My question really is - has anyone done this sort of serch using mmfind? Thankyou From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, March 07, 2018 8:16 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfind -ls and so forth As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Mar 8 11:10:24 2018 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 8 Mar 2018 11:10:24 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Message-ID: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Thu Mar 8 12:33:41 2018 From: david_johnson at brown.edu (david_johnson at brown.edu) Date: Thu, 8 Mar 2018 07:33:41 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active In-Reply-To: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> Message-ID: <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Mar 8 12:42:47 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 8 Mar 2018 07:42:47 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive In-Reply-To: <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Mar 8 13:59:27 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 8 Mar 2018 08:59:27 -0500 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: (John Hearns, et. al.) Some minor script hacking would be the easiest way add test(s) for other MISC_ATTRIBUTES Notice mmfind concentrates on providing the most popular classic(POSIX) and Linux predicates, BUT also adds a few gpfs specific predicates (mmfind --help show you these) -ea -eaWithValue -gpfsImmut -gpfsAppOnly Look at the implementation of -gpfsImmut in tr_findToPol.pl ... sub tr_gpfsImmut{ return "( /* -gpfsImmut */ MISC_ATTRIBUTES LIKE '%X%')"; } So easy to extend this for any or all the others.... True it's perl, but you don't have to be a perl expert to cut-paste-hack another predicate into the script. Let us know how you make out with this... Perhaps we shall add a general predicate -gpfsMiscAttrLike '...' to the next version... -- Marc K of GPFS From: John Hearns To: gpfsug main discussion list Date: 03/08/2018 04:59 AM Subject: Re: [gpfsug-discuss] mmfind -ls and so forth Sent by: gpfsug-discuss-bounces at spectrumscale.org On the subject of mmfind, I would like to find files which have the misc attributes relevant to AFM. For instance files which have the attribute ?v? The file is newly created, not yet copied to home I can write a policy to do this, and I have a relevant policy written. However I would like to do this using mmfind, which seems a nice general utility. This syntax does not work: mmfind /hpc -eaWithValue MISC_ATTRIBUTES===v Before anyone says it, I am mixing up MISC_ATTRIBUTES and extended attributes! My question really is ? has anyone done this sort of serch using mmfind? Thankyou From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, March 07, 2018 8:16 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfind -ls and so forth As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=LDC-t-w-jkuH2fJZ1lME_JUjzABDz3y90ptTlYWM3rc&s=xrFd1LD5dWq9GogfeOGs9ZCtqoptErjmGfJzD3eXhz4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Mar 8 15:16:10 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 8 Mar 2018 15:16:10 +0000 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: <8D4EED0B-A9F8-46FB-8BA2-359A3CF1C630@vanderbilt.edu> Hi Marc, I test in production ? just kidding. But - not kidding - I did read the entire mmfind.README, compiled the binary as described therein, and read the output of ?mmfind -h?. But what I forgot was that when you run a bash shell script with ?bash -x? it doesn?t show you the redirection you did to a file ? and since the mmfind ran for ~5 days, including over a weekend, and including Monday which I took off from work to have our 16 1/2 year old Siberian Husky put to sleep, I simply forgot that in the script itself I had redirected the output to a file. Stupid of me, I know, but unlike Delusional Donald, I?ll admit my stupid mistakes. Thanks, and sorry. I will try the mmfind as you suggested in your previous response the next time I need to run one to see if that significantly improves the performance? Kevin On Mar 7, 2018, at 1:15 PM, Marc A Kaplan > wrote: As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c170869f3294124be3608d5845fdecc%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636560469687764985&sdata=yNvpm34DY0AtEm2Y4OIMll5IW1v5kP3X3vHx3sQ%2B8Rs%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From oluwasijibomi.saula at ndsu.edu Thu Mar 8 15:06:03 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Thu, 8 Mar 2018 15:06:03 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Message-ID: Hi Folks, As this is my first post to the group, let me start by saying I applaud the commentary from the user group as it has been a resource to those of us watching from the sidelines. That said, we have a GPFS layered on IPoIB, and recently, we started having some issues on our IB FDR fabric which manifested when GPFS began sending persistent expel messages to particular nodes. Shortly after, we embarked on a tuning exercise using IBM tuning recommendations but this page is quite old and we've run into some snags, specifically with setting 4k MTUs using mlx4_core/mlx4_en module options. While setting 4k MTUs as the guide recommends is our general inclination, I'd like to solicit some advice as to whether 4k MTUs are a good idea and any hitch-free steps to accomplishing this. I'm getting some conflicting remarks from Mellanox support asking why we'd want to use 4k MTUs with Unreliable Datagram mode. Also, any pointers to best practices or resources for network configurations for heavy I/O clusters would be much appreciated. Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Thu Mar 8 17:37:12 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Thu, 8 Mar 2018 17:37:12 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Thu Mar 8 21:50:11 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Thu, 8 Mar 2018 21:50:11 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: Message-ID: <1520545811808.33125@UTSouthwestern.edu> Hi, Saula, Can the expelled node and expelling node ping each other? We expanded our gpfs IB network from /24 to /20 but some clients still used /24, they cannot talk to the added new clients using /20 and expelled the new clients persistently. Changing the netmask all to /20 works out. FYI. Wei Guo HPC Administartor UT Southwestern Medical Center wei1.guo at utsouthwestern.edu ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Thursday, March 8, 2018 11:37 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 74, Issue 17 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Thoughts on GPFS on IB & MTU sizes (Saula, Oluwasijibomi) 2. Re: wondering about outage free protocols upgrades (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Thu, 8 Mar 2018 15:06:03 +0000 From: "Saula, Oluwasijibomi" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Message-ID: Content-Type: text/plain; charset="windows-1252" Hi Folks, As this is my first post to the group, let me start by saying I applaud the commentary from the user group as it has been a resource to those of us watching from the sidelines. That said, we have a GPFS layered on IPoIB, and recently, we started having some issues on our IB FDR fabric which manifested when GPFS began sending persistent expel messages to particular nodes. Shortly after, we embarked on a tuning exercise using IBM tuning recommendations but this page is quite old and we've run into some snags, specifically with setting 4k MTUs using mlx4_core/mlx4_en module options. While setting 4k MTUs as the guide recommends is our general inclination, I'd like to solicit some advice as to whether 4k MTUs are a good idea and any hitch-free steps to accomplishing this. I'm getting some conflicting remarks from Mellanox support asking why we'd want to use 4k MTUs with Unreliable Datagram mode. Also, any pointers to best practices or resources for network configurations for heavy I/O clusters would be much appreciated. Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 8 Mar 2018 17:37:12 +0000 From: "Christof Schmitt" To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 74, Issue 17 ********************************************** ________________________________ UT Southwestern Medical Center The future of medicine, today. From Greg.Lehmann at csiro.au Fri Mar 9 00:23:10 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 9 Mar 2018 00:23:10 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <2b7547fd8aec467a958d8e10e88bd1e6@exch1-cdc.nexus.csiro.au> That last little bit ?not available today? gives me hope. It would be nice to get there ?one day.? Our situation is we are using NFS for access to images that VMs run from. An outage means shutting down a lot of guests. An NFS outage of even short duration would result in the system disks of VMs going read only due to IO timeouts. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Thursday, 8 March 2018 7:54 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades The problem with the SMB upgrade is with the data shared between the protocol nodes. It is not tied to the protocol version used between SMB clients and the protocol nodes. Samba stores internal data (e.g. for the SMB state of open files) in tdb database files. ctdb then makes these tdb databases available across all protocol nodes. A concurrent upgrade for SMB would require correct handling of ctdb communications and the tdb records across multiple versions; that is not available today. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Wed, Mar 7, 2018 4:35 AM In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=7bQRypv0JL7swEYobmepaynWFvV8HYtBa2_S1kAyirk&s=2bUeeDk-8VbtwU8KtV4RlENEcbpOr_GQJQvR_8gt-ug&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From willi.engeli at id.ethz.ch Fri Mar 9 12:21:27 2018 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Fri, 9 Mar 2018 12:21:27 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 Message-ID: Hello Group, I?ve just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. The protocol also brings important corrections and enhancements with it. I would like to ask you all very kindly to vote for this RFE please. You find it here: https://www.ibm.com/developerworks/rfe/execute Headline:NFS V4.1 Support ID:117398 Freundliche Gr?sse Willi Engeli ETH Zuerich ID Speicherdienste Weinbergstrasse 11 WEC C 18 8092 Zuerich Tel: +41 44 632 02 69 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From jonathan.buzzard at strath.ac.uk Fri Mar 9 12:37:22 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 09 Mar 2018 12:37:22 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> , <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <1520599042.1554.1.camel@strath.ac.uk> On Thu, 2018-03-08 at 09:41 +0000, Sobey, Richard A wrote: > Whether or not you meant it your words ?that is not available today.? > Implies that something is coming in the future? Would you be reliant > on the Samba/CTDB development team or would you roll your own.. > supposing it?s possible in the first place. > ? Back in the day when one had to roll your own Samba for this stuff, rolling Samba upgrades worked. What changed or was it never supported? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From stijn.deweirdt at ugent.be Fri Mar 9 12:42:50 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 9 Mar 2018 13:42:50 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> hi all, i would second this request to upvote this. the fact that 4.1 support was dropped in a subsubminor update (4.2.3.5 to 4.3.26 afaik) was already pretty bad to discover, but at the very least there should be an option to reenable it. i'm also interested why this was removed (or actively prevented to enable). i can understand that eg pnfs is not support, bu basic protocol features wrt HA are a must have. only with 4.1 are we able to do ces+ganesha failover without IO error, something that should be basic feature nowadays. stijn On 03/09/2018 01:21 PM, Engeli Willi (ID SD) wrote: > Hello Group, > > I?ve just created a request for enhancement (RFE) to have ganesha supporting > NFS V4.1. > > It is important, to have this new Protocol version supported, since our > Linux clients default support is more then 80% based in this version by > default and Linux distributions are actively pushing this Protocol. > > The protocol also brings important corrections and enhancements with it. > > > > I would like to ask you all very kindly to vote for this RFE please. > > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > > > > Freundliche Gr?sse > > > > Willi Engeli > > ETH Zuerich > > ID Speicherdienste > > Weinbergstrasse 11 > > WEC C 18 > > 8092 Zuerich > > > > Tel: +41 44 632 02 69 > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From Marcelo.Garcia at EMEA.NEC.COM Fri Mar 9 12:51:22 2018 From: Marcelo.Garcia at EMEA.NEC.COM (Marcelo Garcia) Date: Fri, 9 Mar 2018 12:51:22 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: Hi I got the following error when trying the URL below: {e: 'Exception usecase string is null'} Regards mg. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) Sent: Freitag, 9. M?rz 2018 13:21 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 Hello Group, I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. The protocol also brings important corrections and enhancements with it. I would like to ask you all very kindly to vote for this RFE please. You find it here: https://www.ibm.com/developerworks/rfe/execute Headline:NFS V4.1 Support ID:117398 Freundliche Gr?sse Willi Engeli ETH Zuerich ID Speicherdienste Weinbergstrasse 11 WEC C 18 8092 Zuerich Tel: +41 44 632 02 69 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Fri Mar 9 14:09:59 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 9 Mar 2018 15:09:59 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> hi marcelo, can you try https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117398 stijn On 03/09/2018 01:51 PM, Marcelo Garcia wrote: > Hi > > I got the following error when trying the URL below: > {e: 'Exception usecase string is null'} > > Regards > > mg. > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) > Sent: Freitag, 9. M?rz 2018 13:21 > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 > > Hello Group, > I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. > It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. > The protocol also brings important corrections and enhancements with it. > > I would like to ask you all very kindly to vote for this RFE please. > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > Freundliche Gr?sse > > Willi Engeli > ETH Zuerich > ID Speicherdienste > Weinbergstrasse 11 > WEC C 18 > 8092 Zuerich > > Tel: +41 44 632 02 69 > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From Marcelo.Garcia at EMEA.NEC.COM Fri Mar 9 14:11:35 2018 From: Marcelo.Garcia at EMEA.NEC.COM (Marcelo Garcia) Date: Fri, 9 Mar 2018 14:11:35 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> References: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> Message-ID: Hi stijn Now it's working. Cheers m. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Stijn De Weirdt Sent: Freitag, 9. M?rz 2018 15:10 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 hi marcelo, can you try https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117398 stijn On 03/09/2018 01:51 PM, Marcelo Garcia wrote: > Hi > > I got the following error when trying the URL below: > {e: 'Exception usecase string is null'} > > Regards > > mg. > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) > Sent: Freitag, 9. M?rz 2018 13:21 > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 > > Hello Group, > I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. > It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. > The protocol also brings important corrections and enhancements with it. > > I would like to ask you all very kindly to vote for this RFE please. > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > Freundliche Gr?sse > > Willi Engeli > ETH Zuerich > ID Speicherdienste > Weinbergstrasse 11 > WEC C 18 > 8092 Zuerich > > Tel: +41 44 632 02 69 > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Click https://www.mailcontrol.com/sr/NavEVlEkpX3GX2PQPOmvUqrlA1!9RTN2ec8I4RU35plgh6Q4vQM4vfVPrCpIvwaSEkP!v72X8H9IWrzEXY2ZCw== to report this email as spam. From ewahl at osc.edu Fri Mar 9 14:19:10 2018 From: ewahl at osc.edu (Edward Wahl) Date: Fri, 9 Mar 2018 09:19:10 -0500 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: Message-ID: <20180309091910.0334604a@osc.edu> Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From christof.schmitt at us.ibm.com Fri Mar 9 16:16:41 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 9 Mar 2018 16:16:41 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <1520599042.1554.1.camel@strath.ac.uk> References: <1520599042.1554.1.camel@strath.ac.uk>, <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From oluwasijibomi.saula at ndsu.edu Sat Mar 10 14:29:33 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Sat, 10 Mar 2018 14:29:33 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: <20180309091910.0334604a@osc.edu> References: , <20180309091910.0334604a@osc.edu> Message-ID: Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This experience prompted my tunning exercise on the node and has since resolved the expel messages to node even during times of high I/O activity. Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS may not be optimal. To answer your questions, Ed - IB supports both administrative and daemon communications, and we have verbsRdma configured. Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've been told this should not be the case. I'm hoping to settle on 4096 MTU nodes for the entire cluster but I fear there may be some caveats - any thoughts on this? (Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career with NDSU but he left us shortly after. Maybe like you I can tune up my Japanese as well once my GPFS issues are put to rest! ? ) Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: Edward Wahl Sent: Friday, March 9, 2018 8:19:10 AM To: gpfsug-discuss at spectrumscale.org Cc: Saula, Oluwasijibomi Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Sat Mar 10 16:31:36 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Sat, 10 Mar 2018 16:31:36 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: , <20180309091910.0334604a@osc.edu>, Message-ID: <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> Hi, Saula, This sounds like the problem with the jumbo frame. Ping or metadata query use small packets, so any time you can ping or ls file. However, data transferring use large packets, the MTU size. Your MTU 65536 nodes send out large packets, but they get dropped to the 2044 nodes, because the packet size cannot fit in 2044 size limit. The reverse is ok. I think the gpfs client nodes always communicate with each other to sync the sdr repo files, or other user job mpi communications if there are any. I think all the nodes should agree on a single MTU. I guess ipoib supports up to 4096. I might missed your Ethernet network switch part whether jumbo frame is enabled or not, if you are using any. Wei Guo On Sat, Mar 10, 2018 at 8:29 AM -0600, "Saula, Oluwasijibomi" > wrote: Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This experience prompted my tunning exercise on the node and has since resolved the expel messages to node even during times of high I/O activity. Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS may not be optimal. To answer your questions, Ed - IB supports both administrative and daemon communications, and we have verbsRdma configured. Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've been told this should not be the case. I'm hoping to settle on 4096 MTU nodes for the entire cluster but I fear there may be some caveats - any thoughts on this? (Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career with NDSU but he left us shortly after. Maybe like you I can tune up my Japanese as well once my GPFS issues are put to rest! ? ) Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: Edward Wahl Sent: Friday, March 9, 2018 8:19:10 AM To: gpfsug-discuss at spectrumscale.org Cc: Saula, Oluwasijibomi Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 ________________________________ UT Southwestern Medical Center The future of medicine, today. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Sat Mar 10 16:57:49 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 11:57:49 -0500 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> References: <20180309091910.0334604a@osc.edu> <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> Message-ID: <8fff8715-e67f-b048-f37d-2498c0cac2f7@nasa.gov> I, personally, haven't been burned by mixing UD and RC IPoIB clients on the same fabric but that doesn't mean it can't happen. What I *have* been bitten by a couple times is not having enough entries in the arp cache after bringing a bunch of new nodes online (that made for a long Christmas Eve one year...). You can toggle that via the gc_thresh settings. These settings work for ~3700 nodes (and technically could go much higher). net.ipv4.neigh.default.gc_thresh3 = 10240 net.ipv4.neigh.default.gc_thresh2 = 9216 net.ipv4.neigh.default.gc_thresh1 = 8192 It's the kind of thing that will bite you when you expand the cluster and it may make sense that it's exacerbated by metadata operations because those may require initiating connections to many nodes in the cluster which could blow your arp cache. -Aaron On 3/10/18 11:31 AM, Wei Guo wrote: > Hi, Saula, > > This sounds like the problem with the jumbo frame. > > Ping or metadata query use small packets, so any time you can ping or ls > file. > > However, data transferring use large packets, the MTU size. Your MTU > 65536 nodes send out large packets, but they get dropped to the 2044 > nodes, because the packet size cannot fit in 2044 size limit. The > reverse is ok. > > I think the gpfs client nodes always communicate with each other to sync > the sdr repo files, or other user job mpi communications if there are > any. I think all the nodes should agree on a single MTU. I guess ipoib > supports up to 4096. > > I might missed your Ethernet network switch part whether jumbo frame is > enabled or not, if you are using any. > > Wei Guo > > > > > > > On Sat, Mar 10, 2018 at 8:29 AM -0600, "Saula, Oluwasijibomi" > > wrote: > > Wei -? So the expelled node could ping the rest of the cluster just > fine. In fact, after adding this new node to the cluster I could > traverse the filesystem for simple lookups, however, heavy data > moves in or out of the filesystem seemed to trigger the expel > messages to the new node. > > > This experience prompted my?tunning exercise on the node and has > since resolved the expel messages to node even during times of high > I/O activity. > > > Nevertheless, I still have this nagging feeling that the IPoIB > tuning for GPFS may not be optimal. > > > To answer your questions,?Ed - IB supports both administrative and > daemon communications, and we have verbsRdma configured. > > > Currently, we have both 2044 and 65520 MTU nodes on our IB network > and I've been told this should not be the case. I'm hoping to settle > on 4096 MTU nodes for the entire cluster but I fear there may be > some caveats - any thoughts on this? > > > (Oh, Ed - Hideaki was my mentor for a short while when I began my > HPC career with NDSU but he left us shortly after. Maybe like you I > can tune up my Japanese as well once my GPFS issues are put to rest! > ? ) > > > Thanks, > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > Research 2 > Building > ?? > Room 220B > Dept 4100, PO Box 6050? / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu > ?| > www.ndsu.edu > > ------------------------------------------------------------------------ > *From:* Edward Wahl > *Sent:* Friday, March 9, 2018 8:19:10 AM > *To:* gpfsug-discuss at spectrumscale.org > *Cc:* Saula, Oluwasijibomi > *Subject:* Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes > > Welcome to the list. > > If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des > ne?" for me. > Though I recall he may have left. > > > A couple of questions as I, unfortunately, have a good deal of expel > experience. > > -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" > > -Are you using the IB as the administrative IP network? > > -As Wei asked, can nodes sending the expel requests ping the victim over > whatever interface is being used administratively?? Other interfaces > do NOT > matter for expels. Nodes that cannot even mount the file systems can > still > request expels.? Many many things can cause issues here from routing and > firewalls to bad switch software which will not update ARP tables, > and you get > nodes trying to expel each other. > > -are your NSDs logging the expels in /tmp/mmfs?? You can mmchconfig > expelDataCollectionDailyLimit if you need more captures to narrow > down what is > happening outside the mmfs.log.latest.? Just be wary of the disk > space if you > have "expel storms". > > -That tuning page is very out of date and appears to be mostly > focused on GPFS > 3.5.x tuning.?? While there is also a Spectrum Scale wiki, it's > Linux tuning > page does not appear to be kernel and network focused and is dated > even older. > > > Ed > > > > On Thu, 8 Mar 2018 15:06:03 +0000 > "Saula, Oluwasijibomi" wrote: > > > Hi Folks, > > > > > > As this is my first post to the group, let me start by saying I applaud the > > commentary from the user group as it has been a resource to those of us > > watching from the sidelines. > > > > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > > some issues on our IB FDR fabric which manifested when GPFS began sending > > persistent expel messages to particular nodes. > > > > > > Shortly after, we embarked on a tuning exercise using IBM tuning > > recommendations > > but this page is quite old and we've run into some snags, specifically with > > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > > like to solicit some advice as to whether 4k MTUs are a good idea and any > > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > > Datagram mode. > > > > > > Also, any pointers to best practices or resources for network configurations > > for heavy I/O clusters would be much appreciated. > > > > > > Thanks, > > > > Siji Saula > > HPC System Administrator > > Center for Computationally Assisted Science & Technology > > NORTH DAKOTA STATE UNIVERSITY > > > > > > Research 2 > > Building > > ? Room 220B Dept 4100, PO Box 6050? / Fargo, ND 58108-6050 p:701.231.7749 > > www.ccast.ndsu.edu | > > www.ndsu.edu > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > > > ------------------------------------------------------------------------ > > UTSouthwestern > > Medical Center > > The future of medicine, today. > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Sat Mar 10 21:39:28 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 16:39:28 -0500 Subject: [gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17) Message-ID: <96bf7c94-f5ee-c046-d835-de500bd20c51@nasa.gov> Hey All, I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap now takes a really long time as in... a *really* long time. Digging into it I can see that the snap command is actually done but the sshd child is left waiting on a sleep process on the clients (a sleep 600 at that). Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10 minutes looks like it'll take a good 10 hours. It seems the trouble is in the runCommand function in gpfs.snap. The function creates a child process to act as a sort of alarm to kill the specified command if it exceeds the timeout. The problem while the alarm process gets killed the kill signal isn't passed to the sleep process (because the sleep command is run as a process inside the "alarm" child shell process). In gpfs.snap changing this: [[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1 to this: [[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants $sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1 seems to fix the behavior. I'll open a PMR for this shortly but I'm just wondering if anyone else has seen this. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Sat Mar 10 21:44:39 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 16:44:39 -0500 Subject: [gpfsug-discuss] spontaneous tracing? Message-ID: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> I found myself with a little treat this morning to the tune of tracing running on the entire cluster of 3500 nodes. There were no logs I could find to indicate *why* the tracing had started but it was clear it was initiated by the cluster manager. Some sleuthing (thanks, collectl!) allowed me to figure out that the tracing started as the command: /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmcommon notifyOverload _asmgr I thought that running "mmchocnfig deadlockOverloadThreshold=0 -i" would stop this from happening again but lo and behold tracing kicked off *again* (with the same caller) some time later even after setting that parameter. What's odd is there are no log events to indicate an overload occurred. Has anyone seen similar behavior? We're on 4.2.3.6 efix17. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From mnaineni at in.ibm.com Mon Mar 12 09:54:50 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Mon, 12 Mar 2018 09:54:50 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> References: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be>, Message-ID: An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Mon Mar 12 10:01:15 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Mon, 12 Mar 2018 11:01:15 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> Message-ID: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be> hi malahal, we already figured that out but were hesitant to share it in case ibm wanted to remove this loophole. but can we assume that manuanlly editing the ganesha.conf and pushing it to ccr is supported? the config file is heavily edited / rewritten when certain mm commands, so we want to make sure we can always do this. it would be even better if the main.conf that is generated/edited by the ccr commands just had an include statement so we can edit another file locally instead of doing mmccr magic. stijn On 03/12/2018 10:54 AM, Malahal R Naineni wrote: > Upstream Ganesha code allows all NFS versions including NFSv4.2. Most Linux > clients were defaulting to NFSv4.0, but now they started using NFS4.1 which IBM > doesn't support. To avoid people accidentally using NFSv4.1, we decided to > remove it by default. > We don't support NFSv4.1, so there is no spectrum command to enable NFSv4.1 > support with PTF6. Of course, if you are familiar with mmccr, you can change the > config and let it use NFSv4.1 but any issues with NFS4.1 will go to /dev/null. :-) > You need to add "minor_versions = 0,1;" to NFSv4{} block > in /var/mmfs/ces/nfs-config/gpfs.ganesha.main.conf to allow NFSv4.0 and NFsv4.1, > and make sure you use mmccr command to make this change permanent. > Regards, Malahal. > > ----- Original message ----- > From: Stijn De Weirdt > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: Re: [gpfsug-discuss] asking for your vote for an RFE to support NFS > V4.1 > Date: Fri, Mar 9, 2018 6:13 PM > hi all, > > i would second this request to upvote this. the fact that 4.1 support > was dropped in a subsubminor update (4.2.3.5 to 4.3.26 afaik) was > already pretty bad to discover, but at the very least there should be an > option to reenable it. > > i'm also interested why this was removed (or actively prevented to > enable). i can understand that eg pnfs is not support, bu basic protocol > features wrt HA are a must have. > only with 4.1 are we able to do ces+ganesha failover without IO error, > something that should be basic feature nowadays. > > stijn > > On 03/09/2018 01:21 PM, Engeli Willi (ID SD) wrote: > > Hello Group, > > > > I?ve just created a request for enhancement (RFE) to have ganesha supporting > > NFS V4.1. > > > > It is important, to have this new Protocol version supported, since our > > Linux clients default support is more then 80% based in this version by > > default and Linux distributions are actively pushing this Protocol. > > > > The protocol also brings important corrections and enhancements with it. > > > > > > > > I would like to ask you all very kindly to vote for this RFE please. > > > > You find it here: https://www.ibm.com/developerworks/rfe/execute > > > > Headline:NFS V4.1 Support > > > > ID:117398 > > > > > > > > > > > > Freundliche Gr?sse > > > > > > > > Willi Engeli > > > > ETH Zuerich > > > > ID Speicherdienste > > > > Weinbergstrasse 11 > > > > WEC C 18 > > > > 8092 Zuerich > > > > > > > > Tel: +41 44 632 02 69 > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIF-g&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=yq4xoVKCPWQTqZVp0BgG8fBpXrS2FehGlAua1Eixci4&s=9DJi6qkF4eRc81vv6SlC3gxKL9oJJ4efkktzNaZAnkA&e= > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIF-g&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=yq4xoVKCPWQTqZVp0BgG8fBpXrS2FehGlAua1Eixci4&s=9DJi6qkF4eRc81vv6SlC3gxKL9oJJ4efkktzNaZAnkA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From xhejtman at ics.muni.cz Mon Mar 12 14:51:05 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Mar 2018 15:51:05 +0100 Subject: [gpfsug-discuss] Preferred NSD Message-ID: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek From scale at us.ibm.com Mon Mar 12 15:13:00 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Mon, 12 Mar 2018 09:13:00 -0600 Subject: [gpfsug-discuss] spontaneous tracing? In-Reply-To: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> References: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> Message-ID: /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be started. One can verify that using the underlying command being called as shown in the following example with /tmp/n containing node names one each line that will get the notification and the IP address being the file system manager from which the command is issued. /usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8 The only case that deadlock detection code will initiate tracing is that debugDataControl is set to "heavy" and tracing is not started. Then on deadlock detection tracing is turned on for 20 seconds and turned off. That can be tested using command like /usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8 And then mmfs.log will tell you what's going on. That's not a silent action. 2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock notification from 192.168.117.131 2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug data on this node. 2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing started Trace started: Wait 20 seconds before cut and stop trace 2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped 20 seconds later mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0 mmtrace: formatting /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz > What's odd is there are no log events to indicate an overload occurred. Overload msg is only seen in mmfs.log when debugDataControl is "heavy". mmdiag --deadlock shows overload related info starting from 4.2.3. # mmdiag --deadlock === mmdiag: deadlock === Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for short waiters Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on c69bc2xn01 is 0.01812 <== -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Mar 12 15:14:10 2018 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Mon, 12 Mar 2018 15:14:10 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: Hi Lukas, Check out FPO mode. That mimics Hadoop?s data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero?s NVMesh (note: not an endorsement since I can?t give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I?m not sure if they?ve released that feature yet but in theory it will give better fault tolerance *and* you?ll get more efficient usage of your SSDs. I?m sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Mar 12 15:18:40 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 12 Mar 2018 11:18:40 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: <188417.1520867920@turing-police.cc.vt.edu> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > I don't think like 5 or more data/metadata replicas are practical here. On the > other hand, multiple node failures is something really expected. Umm.. do I want to ask *why*, out of only 60 nodes, multiple node failures are an expected event - to the point that you're thinking about needing 5 replicas to keep things running? From xhejtman at ics.muni.cz Mon Mar 12 15:23:17 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Mar 2018 16:23:17 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <188417.1520867920@turing-police.cc.vt.edu> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> Message-ID: <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: > On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > > I don't think like 5 or more data/metadata replicas are practical here. On the > > other hand, multiple node failures is something really expected. > > Umm.. do I want to ask *why*, out of only 60 nodes, multiple node > failures are an expected event - to the point that you're thinking > about needing 5 replicas to keep things running? as of my experience with cluster management, we have multiple nodes down on regular basis. (HW failure, SW maintenance and so on.) I'm basically thinking that 2-3 replicas might not be enough while 5 or more are becoming too expensive (both disk space and required bandwidth being scratch space - high i/o load expected). -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From mnaineni at in.ibm.com Mon Mar 12 17:41:41 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Mon, 12 Mar 2018 17:41:41 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be> References: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be>, <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From Philipp.Rehs at uni-duesseldorf.de Mon Mar 12 20:09:14 2018 From: Philipp.Rehs at uni-duesseldorf.de (Philipp Helo Rehs) Date: Mon, 12 Mar 2018 21:09:14 +0100 Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8@uni-duesseldorf.de> Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 From zmance at ucar.edu Mon Mar 12 22:10:06 2018 From: zmance at ucar.edu (Zachary Mance) Date: Mon, 12 Mar 2018 16:10:06 -0600 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Since I am testing out remote mounting with EDR IB routers, I'll add to the discussion. In my lab environment I was seeing the same rdma connections being established and then disconnected shortly after. The remote filesystem would eventually mount on the clients, but it look a quite a while (~2mins). Even after mounting, accessing files or any metadata operations would take a while to execute, but eventually it happened. After enabling verbsRdmaCm, everything mounted just fine and in a timely manner. Spectrum Scale was using the librdmacm.so library. I would first double check that you have both clusters able to talk to each other on their IPoIB address, then make sure you enable verbsRdmaCm on both clusters. --------------------------------------------------------------------------------------------------------------- Zach Mance zmance at ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Thu, Mar 1, 2018 at 1:41 AM, John Hearns wrote: > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and rely > heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would appreciate some > more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my cost!) > but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, > but pulled it out of our environment as adding unneeded complexity. It > requires provisioning IP addresses across the Infiniband infrastructure and > possibly adding routers to other portions of the IP infrastructure. It was > also confusing some users due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their support for > GPFS over IB and they kept directing us to using CM (which still didn't > work). CM wasn't necessary once we found out about the actual problem (we > needed the undocumented verbsRdmaUseGidIndexZero configuration option among > other things due to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it. It sounds like the OP is keeping > IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister > > Reply-To: gpfsug main discussion list > > > > To: gpfsug-discuss at spectrumscale.org > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA CM to > work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the network > 192.168.11.0/24. > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > > are connected to a second infiniband network. These servers have IPs > > > on their IB interfaces in the network 192.168.12.0/24. > > > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug- > discuss&data=01%7C01%7Cjohn.hearns%40asml.com% > 7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad > 61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP > 8%3D&reserved=0 > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Tue Mar 13 03:06:34 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Tue, 13 Mar 2018 03:06:34 +0000 Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR (Philipp Helo Rehs) Message-ID: <7b8dd0540c4542668f24c1a20c7aee76@SWMS13MAIL10.swmed.org> Hi, Philipp, FYI. We had exactly the same IBV_WC_RETRY_EXC_ERR error message in our gpfs client log along with other client error kernel: ib0: ipoib_cm_handle_tx_wc_rss: failed cm send event (status=12, wrid=83 vend_err 81) in the syslog. The root cause was a bad IB cable connecting a leaf switch to the core switch where the client used as route. When we changed a new cable, the problem was solved and no more errors. We don't really have ipoib setup. The problem might be different from yours, but does the error message suggest that when your gpfs daemon tries to use mlx5_1, the packets are discarded so no connection? Did you do an IB bonding? Wei Guo HPC Administrator UTSW Message: 1 Date: Mon, 12 Mar 2018 21:09:14 +0100 From: Philipp Helo Rehs To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8 at uni-duesseldorf.de> Content-Type: text/plain; charset=utf-8 Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 ________________________________ UT Southwestern Medical Center The future of medicine, today. From aaron.s.knister at nasa.gov Tue Mar 13 04:49:33 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 13 Mar 2018 00:49:33 -0400 Subject: [gpfsug-discuss] spontaneous tracing? In-Reply-To: References: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> Message-ID: Thanks! I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon notifyOverload" does in fact start tracing for me on one of our clusters (technically 2, one in dev, one in prod). It did *not* start it on another test cluster. It looks to me like the difference is the mmsdrservport settings. On clusters where it's set to 0 tracing *does* start. On clusters where it's set to the default of 1191 (didn't try any other value) tracing *does not* start. I can toggle the behavior by changing the value of mmsdrservport back and forth. I do have a PMR open for this so I'll follow up there too. Thanks again for the help. -Aaron On 3/12/18 11:13 AM, IBM Spectrum Scale wrote: > /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be > started. ?One can verify that using the underlying command being called > as shown in the following example with /tmp/n containing node names one > each line that will get the notification and the IP address being the > file system manager from which the command is issued. > > */usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8* > > The only case that deadlock detection code will initiate tracing is that > debugDataControl is set to "heavy" and tracing is not started. Then on > deadlock detection tracing is turned on for 20 seconds and turned off. > > That can be tested using command like > */usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8* > > And then mmfs.log will tell you what's going on. That's not a silent action. > > *2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock > notification from 192.168.117.131* > *2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug > data on this node.* > *2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing > started* > *Trace started: Wait 20 seconds before cut and stop trace* > *2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped > 20 seconds later* > *mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 > /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0* > *mmtrace: formatting > /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to > /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz* > > > What's odd is there are no log events to indicate an overload occurred. > > Overload msg is only seen in mmfs.log when debugDataControl is "heavy". > mmdiag --deadlock shows overload related info starting from 4.2.3. > > *# mmdiag --deadlock* > > *=== mmdiag: deadlock ===* > > *Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds* > *Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for > short waiters* > > *Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on > c69bc2xn01 is 0.01812 <==* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From john.hearns at asml.com Tue Mar 13 10:37:43 2018 From: john.hearns at asml.com (John Hearns) Date: Tue, 13 Mar 2018 10:37:43 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Mar 13 14:16:30 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Mar 2018 15:16:30 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > Lukas, > It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. > > * I'm thinking about the following setup: > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > There is nothing wrong with this concept, for instance see > https://www.beegfs.io/wiki/BeeOND > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] > Sent: Monday, March 12, 2018 4:14 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD > > Hi Lukas, > > Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. > > You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. > > I'm sure there are other ways to skin this cat too. > > -Aaron > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: > Hello, > > I'm thinking about the following setup: > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each > SSDs as on NSD. > > I don't think like 5 or more data/metadata replicas are practical here. On the > other hand, multiple node failures is something really expected. > > Is there a way to instrument that local NSD is strongly preferred to store > data? I.e. node failure most probably does not result in unavailable data for > the other nodes? > > Or is there any other recommendation/solution to build shared scratch with > GPFS in such setup? (Do not do it including.) > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From jan.sundermann at kit.edu Tue Mar 13 14:35:36 2018 From: jan.sundermann at kit.edu (Jan Erik Sundermann) Date: Tue, 13 Mar 2018 15:35:36 +0100 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hi John We try to route infiniband traffic. The IP traffic is routed separately. The two clusters we try to connect are configured differently, one with IP over IB the other one with dedicated ethernet adapters. Jan Erik On 02/27/2018 10:17 AM, John Hearns wrote: > Jan Erik, > Can you clarify if you are routing IP traffic between the two Infiniband networks. > Or are you routing Infiniband traffic? > > > If I can be of help I manage an Infiniband network which connects to other IP networks using Mellanox VPI gateways, which proxy arp between IB and Ethernet. But I am not running GPFS traffic over these. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sundermann, Jan Erik (SCC) > Sent: Monday, February 26, 2018 5:39 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] Problems with remote mount via routed IB > > > Dear all > > we are currently trying to remote mount a file system in a routed Infiniband test setup and face problems with dropped RDMA connections. The setup is the following: > > - Spectrum Scale Cluster 1 is setup on four servers which are connected to the same infiniband network. Additionally they are connected to a fast ethernet providing ip communication in the network 192.168.11.0/24. > > - Spectrum Scale Cluster 2 is setup on four additional servers which are connected to a second infiniband network. These servers have IPs on their IB interfaces in the network 192.168.12.0/24. > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated machine. > > - We have a dedicated IB hardware router connected to both IB subnets. > > > We tested that the routing, both IP and IB, is working between the two clusters without problems and that RDMA is working fine both for internal communication inside cluster 1 and cluster 2 > > When trying to remote mount a file system from cluster 1 in cluster 2, RDMA communication is not working as expected. Instead we see error messages on the remote host (cluster 2) > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2 > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2 > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 3 > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3 > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 1 > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1 > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 1 > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 0 > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0 > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 0 > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 2 > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2 > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2 > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 3 > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3 > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > > > and in the cluster with the file system (cluster 1) > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > > > > Any advice on how to configure the setup in a way that would allow the remote mount via routed IB would be very appreciated. > > > Thank you and best regards > Jan Erik > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Jan Erik Sundermann Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 26191 Email: jan.sundermann at kit.edu www.scc.kit.edu KIT ? The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5382 bytes Desc: S/MIME Cryptographic Signature URL: From Robert.Oesterlin at nuance.com Tue Mar 13 14:42:24 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 13 Mar 2018 14:42:24 +0000 Subject: [gpfsug-discuss] SSUG USA Spring Meeting - Registration and call for speakers is now open! Message-ID: <1289B944-B4F5-40E8-861C-33423B318457@nuance.com> The registration for the Spring meeting of the SSUG-USA is now open. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Please note that we have limited meeting space so please register only if you?re sure you can attend. Detailed agenda will be published in the coming weeks. If you are interested in presenting, please contact me. I do have several speakers lined up already, but we can use a few more. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From jan.sundermann at kit.edu Tue Mar 13 15:24:13 2018 From: jan.sundermann at kit.edu (Jan Erik Sundermann) Date: Tue, 13 Mar 2018 16:24:13 +0100 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hello Zachary We are currently changing out setup to have IP over IB on all machines to be able to enable verbsRdmaCm. According to Mellanox (https://community.mellanox.com/docs/DOC-2384) ibacm requires pre-populated caches to be distributed to all end hosts with the mapping of IP to the routable GIDs (of both IB subnets). Was this also required in your successful deployment? Best Jan Erik On 03/12/2018 11:10 PM, Zachary Mance wrote: > Since I am testing out remote mounting with EDR IB routers, I'll add to > the discussion. > > In my lab environment I was seeing the same rdma connections being > established and then disconnected shortly after. The remote filesystem > would eventually mount on the clients, but it look a quite a while > (~2mins). Even after mounting, accessing files or any metadata > operations would take a while to execute, but eventually it happened. > > After enabling verbsRdmaCm, everything mounted just fine and in a timely > manner. Spectrum Scale was using the?librdmacm.so library. > > I would first double check that you have both clusters able to talk to > each other on their IPoIB address, then make sure you enable verbsRdmaCm > on both clusters. > > > --------------------------------------------------------------------------------------------------------------- > Zach Mance zmance at ucar.edu ?(303) 497-1883 > HPC Data Infrastructure Group?/ CISL / NCAR > --------------------------------------------------------------------------------------------------------------- > > > On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > wrote: > > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and > rely heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would > appreciate some more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my > cost!) but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > > [mailto:gpfsug-discuss-bounces at spectrumscale.org > ] On Behalf Of > Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB.? We did run IPoIB years > ago, but pulled it out of our environment as adding unneeded > complexity.? It requires provisioning IP addresses across the > Infiniband infrastructure and possibly adding routers to other > portions of the IP infrastructure.? It was also confusing some users > due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their > support for GPFS over IB and they kept directing us to using CM > (which still didn't work).? CM wasn't necessary once we found out > about the actual problem (we needed the undocumented > verbsRdmaUseGidIndexZero configuration option among other things due > to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it.? It sounds like the OP is > keeping IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister > > > Reply-To: gpfsug main discussion list > > > > > To: gpfsug-discuss at spectrumscale.org > > > Subject: Re: [gpfsug-discuss] Problems with remote mount via > routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA > CM to work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the > network 192.168.11.0/24 . > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers > which > > > are connected to a second infiniband network. These servers > have IPs > > > on their IB interfaces in the network 192.168.12.0/24 > . > > > > > > - IP is routed between 192.168.11.0/24 > and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB > subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine > both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > > fsug.org > %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com > %7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > > ug.org > %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com > %7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but > never lost! > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --? Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 > > -- The information contained in this communication and any > attachments is confidential and may be privileged, and is for the > sole use of the intended recipient(s). Any unauthorized review, use, > disclosure or distribution is prohibited. Unless explicitly stated > otherwise in the body of this communication or the attachment > thereto (if any), the information is provided on an AS-IS basis > without any express or implied warranties or liabilities. To the > extent you are relying on this information, you are doing so at your > own risk. If you are not the intended recipient, please notify the > sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor > the company/group of companies he or she represents shall be liable > for the proper and complete transmission of the information > contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Jan Erik Sundermann Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 26191 Email: jan.sundermann at kit.edu www.scc.kit.edu KIT ? The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5382 bytes Desc: S/MIME Cryptographic Signature URL: From alex at calicolabs.com Tue Mar 13 17:48:21 2018 From: alex at calicolabs.com (Alex Chekholko) Date: Tue, 13 Mar 2018 10:48:21 -0700 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> Message-ID: Hi Lukas, I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek wrote: > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > Lukas, > > It looks like you are proposing a setup which uses your compute servers > as storage servers also? > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > servers.. Using them as a shared scratch area with GPFS is one of the > options. > > > > > * I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > There is nothing wrong with this concept, for instance see > > https://www.beegfs.io/wiki/BeeOND > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > You should look at "failure zones" also. > > you still need the storage servers and local SSDs to use only for caching, > do > I understand correctly? > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > Sent: Monday, March 12, 2018 4:14 PM > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > Hi Lukas, > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > can have up to 3 replicas both data and metadata but still the downside, > though, as you say is the wrong node failures will take your cluster down. > > > > You might want to check out something like Excelero's NVMesh (note: not > an endorsement since I can't give such things) which can create logical > volumes across all your NVMe drives. The product has erasure coding on > their roadmap. I'm not sure if they've released that feature yet but in > theory it will give better fault tolerance *and* you'll get more efficient > usage of your SSDs. > > > > I'm sure there are other ways to skin this cat too. > > > > -Aaron > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: > > Hello, > > > > I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > I would like to setup shared scratch area using GPFS and those NVMe > SSDs. Each > > SSDs as on NSD. > > > > I don't think like 5 or more data/metadata replicas are practical here. > On the > > other hand, multiple node failures is something really expected. > > > > Is there a way to instrument that local NSD is strongly preferred to > store > > data? I.e. node failure most probably does not result in unavailable > data for > > the other nodes? > > > > Or is there any other recommendation/solution to build shared scratch > with > > GPFS in such setup? (Do not do it including.) > > > > -- > > Luk?? Hejtm?nek > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the body > of this communication or the attachment thereto (if any), the information > is provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zmance at ucar.edu Tue Mar 13 19:38:48 2018 From: zmance at ucar.edu (Zachary Mance) Date: Tue, 13 Mar 2018 13:38:48 -0600 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hi Jan, I am NOT using the pre-populated cache that mellanox refers to in it's documentation. After chatting with support, I don't believe that's necessary anymore (I didn't get a straight answer out of them). For the subnet prefix, make sure to use one from the range 0xfec0000000000000-0xfec000000000001f. --------------------------------------------------------------------------------------------------------------- Zach Mance zmance at ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Tue, Mar 13, 2018 at 9:24 AM, Jan Erik Sundermann wrote: > Hello Zachary > > We are currently changing out setup to have IP over IB on all machines to > be able to enable verbsRdmaCm. > > According to Mellanox (https://community.mellanox.com/docs/DOC-2384) > ibacm requires pre-populated caches to be distributed to all end hosts with > the mapping of IP to the routable GIDs (of both IB subnets). Was this also > required in your successful deployment? > > Best > Jan Erik > > > > On 03/12/2018 11:10 PM, Zachary Mance wrote: > >> Since I am testing out remote mounting with EDR IB routers, I'll add to >> the discussion. >> >> In my lab environment I was seeing the same rdma connections being >> established and then disconnected shortly after. The remote filesystem >> would eventually mount on the clients, but it look a quite a while >> (~2mins). Even after mounting, accessing files or any metadata operations >> would take a while to execute, but eventually it happened. >> >> After enabling verbsRdmaCm, everything mounted just fine and in a timely >> manner. Spectrum Scale was using the librdmacm.so library. >> >> I would first double check that you have both clusters able to talk to >> each other on their IPoIB address, then make sure you enable verbsRdmaCm on >> both clusters. >> >> >> ------------------------------------------------------------ >> --------------------------------------------------- >> Zach Mance zmance at ucar.edu (303) 497-1883 >> HPC Data Infrastructure Group / CISL / NCAR >> ------------------------------------------------------------ >> --------------------------------------------------- >> >> On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > > wrote: >> >> In reply to Stuart, >> our setup is entirely Infiniband. We boot and install over IB, and >> rely heavily on IP over Infiniband. >> >> As for users being 'confused' due to multiple IPs, I would >> appreciate some more depth on that one. >> Sure, all batch systems are sensitive to hostnames (as I know to my >> cost!) but once you get that straightened out why should users care? >> I am not being aggressive, just keen to find out more. >> >> >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> >> [mailto:gpfsug-discuss-bounces at spectrumscale.org >> ] On Behalf Of >> Stuart Barkley >> Sent: Wednesday, February 28, 2018 6:50 PM >> To: gpfsug main discussion list > > >> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB >> >> The problem with CM is that it seems to require configuring IP over >> Infiniband. >> >> I'm rather strongly opposed to IP over IB. We did run IPoIB years >> ago, but pulled it out of our environment as adding unneeded >> complexity. It requires provisioning IP addresses across the >> Infiniband infrastructure and possibly adding routers to other >> portions of the IP infrastructure. It was also confusing some users >> due to multiple IPs on the compute infrastructure. >> >> We have recently been in discussions with a vendor about their >> support for GPFS over IB and they kept directing us to using CM >> (which still didn't work). CM wasn't necessary once we found out >> about the actual problem (we needed the undocumented >> verbsRdmaUseGidIndexZero configuration option among other things due >> to their use of SR-IOV based virtual IB interfaces). >> >> We don't use routed Infiniband and it might be that CM and IPoIB is >> required for IB routing, but I doubt it. It sounds like the OP is >> keeping IB and IP infrastructure separate. >> >> Stuart Barkley >> >> On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: >> >> > Date: Mon, 26 Feb 2018 14:16:34 >> > From: Aaron Knister > > >> > Reply-To: gpfsug main discussion list >> > > > >> > To: gpfsug-discuss at spectrumscale.org >> >> > Subject: Re: [gpfsug-discuss] Problems with remote mount via >> routed IB >> > >> > Hi Jan Erik, >> > >> > It was my understanding that the IB hardware router required RDMA >> CM to work. >> > By default GPFS doesn't use the RDMA Connection Manager but it can >> be >> > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart >> on >> > clients/servers (in both clusters) to take effect. Maybe someone >> else >> > on the list can comment in more detail-- I've been told folks have >> > successfully deployed IB routers with GPFS. >> > >> > -Aaron >> > >> > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: >> > > >> > > Dear all >> > > >> > > we are currently trying to remote mount a file system in a routed >> > > Infiniband test setup and face problems with dropped RDMA >> > > connections. The setup is the >> > > following: >> > > >> > > - Spectrum Scale Cluster 1 is setup on four servers which are >> > > connected to the same infiniband network. Additionally they are >> > > connected to a fast ethernet providing ip communication in the >> network 192.168.11.0/24 . >> > > >> > > - Spectrum Scale Cluster 2 is setup on four additional servers >> which >> > > are connected to a second infiniband network. These servers >> have IPs >> > > on their IB interfaces in the network 192.168.12.0/24 >> . >> > > >> > > - IP is routed between 192.168.11.0/24 >> and 192.168.12.0/24 on a >> >> > > dedicated machine. >> > > >> > > - We have a dedicated IB hardware router connected to both IB >> subnets. >> > > >> > > >> > > We tested that the routing, both IP and IB, is working between >> the >> > > two clusters without problems and that RDMA is working fine >> both for >> > > internal communication inside cluster 1 and cluster 2 >> > > >> > > When trying to remote mount a file system from cluster 1 in >> cluster >> > > 2, RDMA communication is not working as expected. Instead we see >> > > error messages on the remote host (cluster 2) >> > > >> > > >> > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 1 >> > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 1 >> > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 1 >> > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 0 >> > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 0 >> > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 0 >> > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 2 >> > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > and in the cluster with the file system (cluster 1) >> > > >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > >> > > Any advice on how to configure the setup in a way that would >> allow >> > > the remote mount via routed IB would be very appreciated. >> > > >> > > >> > > Thank you and best regards >> > > Jan Erik >> > > >> > > >> > > >> > > >> > > _______________________________________________ >> > > gpfsug-discuss mailing list >> > > gpfsug-discuss at spectrumscale.org >> > > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp >> > > >> > > fsug.org >> %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data >> =01%7C01%7Cjohn.h >> > > earns%40asml.com >> %7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944e >> > > >> b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE >> > > YpqcNNP8%3D&reserved=0 >> > > >> > >> > -- >> > Aaron Knister >> > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight >> > Center >> > (301) 286-2776 >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org >> > >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfs >> > 3A%2F%2Fgpfs> >> > ug.org >> %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data= >> 01%7C01%7Cjohn.hearn >> > s%40asml.com >> %7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944eb2a39d >> > >> 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOS >> REYpqcNNP8 >> > %3D&reserved=0 >> > >> >> -- >> I've never been lost; I was once bewildered for three days, but >> never lost! >> -- Daniel Boone >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 >> > 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0> >> -- The information contained in this communication and any >> attachments is confidential and may be privileged, and is for the >> sole use of the intended recipient(s). Any unauthorized review, use, >> disclosure or distribution is prohibited. Unless explicitly stated >> otherwise in the body of this communication or the attachment >> thereto (if any), the information is provided on an AS-IS basis >> without any express or implied warranties or liabilities. To the >> extent you are relying on this information, you are doing so at your >> own risk. If you are not the intended recipient, please notify the >> sender immediately by replying to this message and destroy all >> copies of this message and any attachments. Neither the sender nor >> the company/group of companies he or she represents shall be liable >> for the proper and complete transmission of the information >> contained in this communication, or for any delay in its receipt. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> > -- > > Karlsruhe Institute of Technology (KIT) > Steinbuch Centre for Computing (SCC) > > Jan Erik Sundermann > > Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 > D-76344 Eggenstein-Leopoldshafen > > Tel: +49 721 608 26191 > Email: jan.sundermann at kit.edu > www.scc.kit.edu > > KIT ? The Research University in the Helmholtz Association > > Since 2010, KIT has been certified as a family-friendly university. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Mar 14 09:28:15 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 14 Mar 2018 10:28:15 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> Message-ID: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From luis.bolinches at fi.ibm.com Wed Mar 14 10:11:31 2018 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 14 Mar 2018 10:11:31 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: Hi For reads only have you look at possibility of using LROC? For writes in the setup you mention you are down to maximum of half your network speed (best case) assuming no restripes no reboots on going at any given time. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Consultant IT Specialist Mobile Phone: +358503112585 https://www.youracclaim.com/user/luis-bolinches "If you always give you will always have" -- Anonymous > On 14 Mar 2018, at 5.28, Lukas Hejtmanek wrote: > > Hello, > > thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe > disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD > that could build nice shared scratch. Moreover, I have no different HW or place > to put these SSDs into. They have to be in the compute nodes. > >> On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: >> I would like to discourage you from building a large distributed clustered >> filesystem made of many unreliable components. You will need to >> overprovision your interconnect and will also spend a lot of time in >> "healing" or "degraded" state. >> >> It is typically cheaper to centralize the storage into a subset of nodes >> and configure those to be more highly available. E.g. of your 60 nodes, >> take 8 and put all the storage into those and make that a dedicated GPFS >> cluster with no compute jobs on those nodes. Again, you'll still need >> really beefy and reliable interconnect to make this work. >> >> Stepping back; what is the actual problem you're trying to solve? I have >> certainly been in that situation before, where the problem is more like: "I >> have a fixed hardware configuration that I can't change, and I want to try >> to shoehorn a parallel filesystem onto that." >> >> I would recommend looking closer at your actual workloads. If this is a >> "scratch" filesystem and file access is mostly from one node at a time, >> it's not very useful to make two additional copies of that data on other >> nodes, and it will only slow you down. >> >> Regards, >> Alex >> >> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek >> wrote: >> >>>> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: >>>> Lukas, >>>> It looks like you are proposing a setup which uses your compute servers >>> as storage servers also? >>> >>> yes, exactly. I would like to utilise NVMe SSDs that are in every compute >>> servers.. Using them as a shared scratch area with GPFS is one of the >>> options. >>> >>>> >>>> * I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected >>>> >>>> There is nothing wrong with this concept, for instance see >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.beegfs.io_wiki_BeeOND&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZUDwVonh6dmGRFw0n9p9QPC2-DFuVyY75gOuD02c07I&e= >>>> >>>> I have an NVMe filesystem which uses 60 drives, but there are 10 servers. >>>> You should look at "failure zones" also. >>> >>> you still need the storage servers and local SSDs to use only for caching, >>> do >>> I understand correctly? >>> >>>> >>>> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- >>> bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. >>> (GSFC-606.2)[COMPUTER SCIENCE CORP] >>>> Sent: Monday, March 12, 2018 4:14 PM >>>> To: gpfsug main discussion list >>>> Subject: Re: [gpfsug-discuss] Preferred NSD >>>> >>>> Hi Lukas, >>>> >>>> Check out FPO mode. That mimics Hadoop's data placement features. You >>> can have up to 3 replicas both data and metadata but still the downside, >>> though, as you say is the wrong node failures will take your cluster down. >>>> >>>> You might want to check out something like Excelero's NVMesh (note: not >>> an endorsement since I can't give such things) which can create logical >>> volumes across all your NVMe drives. The product has erasure coding on >>> their roadmap. I'm not sure if they've released that feature yet but in >>> theory it will give better fault tolerance *and* you'll get more efficient >>> usage of your SSDs. >>>> >>>> I'm sure there are other ways to skin this cat too. >>>> >>>> -Aaron >>>> >>>> >>>> >>>> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek >> > wrote: >>>> Hello, >>>> >>>> I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected >>>> >>>> I would like to setup shared scratch area using GPFS and those NVMe >>> SSDs. Each >>>> SSDs as on NSD. >>>> >>>> I don't think like 5 or more data/metadata replicas are practical here. >>> On the >>>> other hand, multiple node failures is something really expected. >>>> >>>> Is there a way to instrument that local NSD is strongly preferred to >>> store >>>> data? I.e. node failure most probably does not result in unavailable >>> data for >>>> the other nodes? >>>> >>>> Or is there any other recommendation/solution to build shared scratch >>> with >>>> GPFS in such setup? (Do not do it including.) >>>> >>>> -- >>>> Luk?? Hejtm?nek >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>>> -- The information contained in this communication and any attachments >>> is confidential and may be privileged, and is for the sole use of the >>> intended recipient(s). Any unauthorized review, use, disclosure or >>> distribution is prohibited. Unless explicitly stated otherwise in the body >>> of this communication or the attachment thereto (if any), the information >>> is provided on an AS-IS basis without any express or implied warranties or >>> liabilities. To the extent you are relying on this information, you are >>> doing so at your own risk. If you are not the intended recipient, please >>> notify the sender immediately by replying to this message and destroy all >>> copies of this message and any attachments. Neither the sender nor the >>> company/group of companies he or she represents shall be liable for the >>> proper and complete transmission of the information contained in this >>> communication, or for any delay in its receipt. >>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>> >>> >>> -- >>> Luk?? Hejtm?nek >>> >>> Linux Administrator only because >>> Full Time Multitasking Ninja >>> is not an official job title >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint. com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 14 10:24:39 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 10:24:39 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: I would look at using LROC and possibly using HAWC ... Note you need to be a bit careful with HAWC client side and failure group placement. Simon ?On 14/03/2018, 09:28, "gpfsug-discuss-bounces at spectrumscale.org on behalf of xhejtman at ics.muni.cz" wrote: Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From zacekm at img.cas.cz Wed Mar 14 10:57:36 2018 From: zacekm at img.cas.cz (Michal Zacek) Date: Wed, 14 Mar 2018 11:57:36 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> Message-ID: <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> Hi, I don't think the GPFS is good choice for your setup. Did you consider GlusterFS? It's used at Max Planck Institute at Dresden for HPC computing of? Molecular Biology data. They have similar setup,? tens (hundreds) of computers with shared local storage in glusterfs. But you will need 10Gb network. Michal Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): > On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: >> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: >>> I don't think like 5 or more data/metadata replicas are practical here. On the >>> other hand, multiple node failures is something really expected. >> Umm.. do I want to ask *why*, out of only 60 nodes, multiple node >> failures are an expected event - to the point that you're thinking >> about needing 5 replicas to keep things running? > as of my experience with cluster management, we have multiple nodes down on > regular basis. (HW failure, SW maintenance and so on.) > > I'm basically thinking that 2-3 replicas might not be enough while 5 or more > are becoming too expensive (both disk space and required bandwidth being > scratch space - high i/o load expected). > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3776 bytes Desc: Elektronicky podpis S/MIME URL: From aaron.s.knister at nasa.gov Wed Mar 14 15:28:53 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 14 Mar 2018 11:28:53 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> Message-ID: <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> I don't want to start a religious filesystem war, but I'd give pause to GlusterFS based on a number of operational issues I've personally experienced and seen others experience with it. I'm curious how glusterfs would resolve the issue here of multiple clients failing simultaneously (unless you're talking about using disperse volumes)? That does, actually, bring up an interesting question to IBM which is -- when will mestor see the light of day? This is admittedly something other filesystems can do that GPFS cannot. -Aaron On 3/14/18 6:57 AM, Michal Zacek wrote: > Hi, > > I don't think the GPFS is good choice for your setup. Did you consider > GlusterFS? It's used at Max Planck Institute at Dresden for HPC > computing of? Molecular Biology data. They have similar setup,? tens > (hundreds) of computers with shared local storage in glusterfs. But you > will need 10Gb network. > > Michal > > > Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): >> On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: >>> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: >>>> I don't think like 5 or more data/metadata replicas are practical here. On the >>>> other hand, multiple node failures is something really expected. >>> Umm.. do I want to ask *why*, out of only 60 nodes, multiple node >>> failures are an expected event - to the point that you're thinking >>> about needing 5 replicas to keep things running? >> as of my experience with cluster management, we have multiple nodes down on >> regular basis. (HW failure, SW maintenance and so on.) >> >> I'm basically thinking that 2-3 replicas might not be enough while 5 or more >> are becoming too expensive (both disk space and required bandwidth being >> scratch space - high i/o load expected). >> > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From skylar2 at u.washington.edu Wed Mar 14 15:42:37 2018 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Wed, 14 Mar 2018 15:42:37 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> Message-ID: <20180314154237.u4d3hqraqcn6a4xl@utumno.gs.washington.edu> I agree. We have a small Gluster filesystem we use to perform failover of our job scheduler, but it predates our use of GPFS. We've run into a number of strange failures and "soft failures" (i.e. filesystem admin tools don't work but the filesystem is available), and the logging is much more cryptic and jumbled than mmfs.log. We'll soon be retiring it in favor of GPFS. On Wed, Mar 14, 2018 at 11:28:53AM -0400, Aaron Knister wrote: > I don't want to start a religious filesystem war, but I'd give pause to > GlusterFS based on a number of operational issues I've personally > experienced and seen others experience with it. > > I'm curious how glusterfs would resolve the issue here of multiple clients > failing simultaneously (unless you're talking about using disperse volumes)? > That does, actually, bring up an interesting question to IBM which is -- > when will mestor see the light of day? This is admittedly something other > filesystems can do that GPFS cannot. > > -Aaron > > On 3/14/18 6:57 AM, Michal Zacek wrote: > > Hi, > > > > I don't think the GPFS is good choice for your setup. Did you consider > > GlusterFS? It's used at Max Planck Institute at Dresden for HPC > > computing of? Molecular Biology data. They have similar setup,? tens > > (hundreds) of computers with shared local storage in glusterfs. But you > > will need 10Gb network. > > > > Michal > > > > > > Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): > > > On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: > > > > On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > > > > > I don't think like 5 or more data/metadata replicas are practical here. On the > > > > > other hand, multiple node failures is something really expected. > > > > Umm.. do I want to ask *why*, out of only 60 nodes, multiple node > > > > failures are an expected event - to the point that you're thinking > > > > about needing 5 replicas to keep things running? > > > as of my experience with cluster management, we have multiple nodes down on > > > regular basis. (HW failure, SW maintenance and so on.) > > > > > > I'm basically thinking that 2-3 replicas might not be enough while 5 or more > > > are becoming too expensive (both disk space and required bandwidth being > > > scratch space - high i/o load expected). > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From JRLang at uwyo.edu Wed Mar 14 14:11:35 2018 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Wed, 14 Mar 2018 14:11:35 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Mar 14 16:54:16 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 16:54:16 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz>, Message-ID: Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From r.sobey at imperial.ac.uk Wed Mar 14 17:33:02 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 14 Mar 2018 17:33:02 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz>, Message-ID: >> 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ulmer at ulmer.org Wed Mar 14 18:59:29 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 14 Mar 2018 14:59:29 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen > On Mar 14, 2018, at 1:33 PM, Sobey, Richard A > wrote: > >>> 2. Have data management edition and capacity license the amount of storage. > There goes the budget ? > > Richard > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Simon Thompson (IT Research Support) > Sent: 14 March 2018 16:54 > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > Not always true. > > 1. Use them with socket licenses as HAWC or LROC is OK on a client. > 2. Have data management edition and capacity license the amount of storage. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org ] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu ] > Sent: 14 March 2018 14:11 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD > > Something I haven't heard in this discussion, it that of licensing of GPFS. > > I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Lukas Hejtmanek > Sent: Wednesday, March 14, 2018 4:28 AM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > Hello, > > thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. > > On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: >> I would like to discourage you from building a large distributed >> clustered filesystem made of many unreliable components. You will >> need to overprovision your interconnect and will also spend a lot of >> time in "healing" or "degraded" state. >> >> It is typically cheaper to centralize the storage into a subset of >> nodes and configure those to be more highly available. E.g. of your >> 60 nodes, take 8 and put all the storage into those and make that a >> dedicated GPFS cluster with no compute jobs on those nodes. Again, >> you'll still need really beefy and reliable interconnect to make this work. >> >> Stepping back; what is the actual problem you're trying to solve? I >> have certainly been in that situation before, where the problem is >> more like: "I have a fixed hardware configuration that I can't change, >> and I want to try to shoehorn a parallel filesystem onto that." >> >> I would recommend looking closer at your actual workloads. If this is >> a "scratch" filesystem and file access is mostly from one node at a >> time, it's not very useful to make two additional copies of that data >> on other nodes, and it will only slow you down. >> >> Regards, >> Alex >> >> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek >> > >> wrote: >> >>> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: >>>> Lukas, >>>> It looks like you are proposing a setup which uses your compute >>>> servers >>> as storage servers also? >>> >>> yes, exactly. I would like to utilise NVMe SSDs that are in every >>> compute servers.. Using them as a shared scratch area with GPFS is >>> one of the options. >>> >>>> >>>> * I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB >>>> interconnected >>>> >>>> There is nothing wrong with this concept, for instance see >>>> https://www.beegfs.io/wiki/BeeOND >>>> >>>> I have an NVMe filesystem which uses 60 drives, but there are 10 servers. >>>> You should look at "failure zones" also. >>> >>> you still need the storage servers and local SSDs to use only for >>> caching, do I understand correctly? >>> >>>> >>>> From: gpfsug-discuss-bounces at spectrumscale.org >>>> [mailto:gpfsug-discuss- >>> bounces at spectrumscale.org ] On Behalf Of Knister, Aaron S. >>> (GSFC-606.2)[COMPUTER SCIENCE CORP] >>>> Sent: Monday, March 12, 2018 4:14 PM >>>> To: gpfsug main discussion list > >>>> Subject: Re: [gpfsug-discuss] Preferred NSD >>>> >>>> Hi Lukas, >>>> >>>> Check out FPO mode. That mimics Hadoop's data placement features. >>>> You >>> can have up to 3 replicas both data and metadata but still the >>> downside, though, as you say is the wrong node failures will take your cluster down. >>>> >>>> You might want to check out something like Excelero's NVMesh >>>> (note: not >>> an endorsement since I can't give such things) which can create >>> logical volumes across all your NVMe drives. The product has erasure >>> coding on their roadmap. I'm not sure if they've released that >>> feature yet but in theory it will give better fault tolerance *and* >>> you'll get more efficient usage of your SSDs. >>>> >>>> I'm sure there are other ways to skin this cat too. >>>> >>>> -Aaron >>>> >>>> >>>> >>>> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek >>>> >>> >> wrote: >>>> Hello, >>>> >>>> I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB >>>> interconnected >>>> >>>> I would like to setup shared scratch area using GPFS and those >>>> NVMe >>> SSDs. Each >>>> SSDs as on NSD. >>>> >>>> I don't think like 5 or more data/metadata replicas are practical here. >>> On the >>>> other hand, multiple node failures is something really expected. >>>> >>>> Is there a way to instrument that local NSD is strongly preferred >>>> to >>> store >>>> data? I.e. node failure most probably does not result in >>>> unavailable >>> data for >>>> the other nodes? >>>> >>>> Or is there any other recommendation/solution to build shared >>>> scratch >>> with >>>> GPFS in such setup? (Do not do it including.) >>>> >>>> -- >>>> Luk?? Hejtm?nek >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> -- The information contained in this communication and any >>>> attachments >>> is confidential and may be privileged, and is for the sole use of >>> the intended recipient(s). Any unauthorized review, use, disclosure >>> or distribution is prohibited. Unless explicitly stated otherwise in >>> the body of this communication or the attachment thereto (if any), >>> the information is provided on an AS-IS basis without any express or >>> implied warranties or liabilities. To the extent you are relying on >>> this information, you are doing so at your own risk. If you are not >>> the intended recipient, please notify the sender immediately by >>> replying to this message and destroy all copies of this message and >>> any attachments. Neither the sender nor the company/group of >>> companies he or she represents shall be liable for the proper and >>> complete transmission of the information contained in this communication, or for any delay in its receipt. >>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> -- >>> Luk?? Hejtm?nek >>> >>> Linux Administrator only because >>> Full Time Multitasking Ninja >>> is not an official job title >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Mar 14 19:23:18 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 14 Mar 2018 14:23:18 -0500 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Message-ID: My understanding is that with Spectrum Scale 5.0 there is no longer a standard edition, only data management and advanced, and the pricing is all done via storage not sockets. Now there may be some grandfathering for those with existing socket licenses but I really do not know. My point is that data management is not the same as advanced edition. Again I could be wrong because I tend not to concern myself with how the product is licensed. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Stephen Ulmer To: gpfsug main discussion list Date: 03/14/2018 03:06 PM Subject: Re: [gpfsug-discuss] Preferred NSD Sent by: gpfsug-discuss-bounces at spectrumscale.org Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen On Mar 14, 2018, at 1:33 PM, Sobey, Richard A wrote: 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org < gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [ JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org < gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek wrote: On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=kB88vNQV9x5UFOu3tBxpRKmS3rSCi68KIBxOa_D5ji8&s=R9wxUL1IMkjtWZsFkSAXRUmuKi8uS1jpQRYVTvOYq3g&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 14 19:27:57 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 19:27:57 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org>, Message-ID: I don't think this is correct. My understanding is: There is no longer express edition. Grand fathered to standard. Standard edition (sockets) remains. Advanced edition (sockets) is available for existing advanced customers only. Grand fathering to DME available. Data management (mostly capacity but per disk in ESS and DSS-G configs, different cost for flash or spinning drives). I'm sure Carl can correct me if I'm wrong here. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of stockf at us.ibm.com [stockf at us.ibm.com] Sent: 14 March 2018 19:23 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD My understanding is that with Spectrum Scale 5.0 there is no longer a standard edition, only data management and advanced, and the pricing is all done via storage not sockets. Now there may be some grandfathering for those with existing socket licenses but I really do not know. My point is that data management is not the same as advanced edition. Again I could be wrong because I tend not to concern myself with how the product is licensed. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Stephen Ulmer To: gpfsug main discussion list Date: 03/14/2018 03:06 PM Subject: Re: [gpfsug-discuss] Preferred NSD Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen On Mar 14, 2018, at 1:33 PM, Sobey, Richard A > wrote: 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=kB88vNQV9x5UFOu3tBxpRKmS3rSCi68KIBxOa_D5ji8&s=R9wxUL1IMkjtWZsFkSAXRUmuKi8uS1jpQRYVTvOYq3g&e= From makaplan at us.ibm.com Wed Mar 14 20:02:15 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 14 Mar 2018 15:02:15 -0500 Subject: [gpfsug-discuss] Editions and Licensing / Preferred NSD In-Reply-To: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Message-ID: Thread seems to have gone off on a product editions and Licensing tangents -- refer to IBM website for official statements: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1in_IntroducingIBMSpectrumScale.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Wed Mar 14 15:36:32 2018 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Wed, 14 Mar 2018 15:36:32 +0000 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Message-ID: Is it possible (albeit not advisable) to mirror LUNs that are NSD's to another storage array in another site basically for DR purposes? Once it's mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Wed Mar 14 20:31:01 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Wed, 14 Mar 2018 20:31:01 +0000 Subject: [gpfsug-discuss] Editions and Licensing / Preferred NSD In-Reply-To: References: Message-ID: Simon's description is correct. For those who don't have it readily to hand I'll reiterate it here (in my own words): We discontinued Express a while back; everybody on that edition got a free upgrade to Standard. Standard continues to be licensed on sockets. This has certain advantages (clients and FPOs nodes are cheap, but as noted in the thread if you need to change them to servers, they get more expensive) Advanced was retired; those already on it were "grandfathered in" can continue to buy it, so no forced conversion. But no new customers. In place of Advanced, Data Management Edition is licensed by the TiB. This has the advantage of simplicity -- it is completely flat regardless of topology. It also allows you to add and subtract nodes, including clients, or change a client node to a server node, at will without having to go through a licensing transaction or keep count of clients or pay a penalty for putting clients in a separate compute cluster or ... BTW, I'll be at the UG in London and (probably) in Boston, if anybody wants to talk licensing... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com ********************************************** From olaf.weiser at de.ibm.com Wed Mar 14 23:19:03 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Mar 2018 00:19:03 +0100 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From secretary at gpfsug.org Thu Mar 15 10:00:08 2018 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 15 Mar 2018 10:00:08 +0000 Subject: [gpfsug-discuss] Meetup at the IBM System Z Technical University Message-ID: <738c1046e602fb96e1dc6e5772c0a65a@webmail.gpfsug.org> Dear members, We have another meet up opportunity for you! There's a Spectrum Scale Meet Up taking place at the System Z Technical University on 14th May in London. It's free to attend and is an ideal opportunity to learn about Spectrum Scale on IBM Z in particular and hear from the UK Met Office. Please email your registration to Par Hettinga par at nl.ibm.com and if you have any questions, please contact Par. Date: Monday 14th May 2018 Time: 4.15pm - 6:15 PM Agenda: 16.15 - Welcome & Introductions 16.25 - IBM Spectrum Scale and Industry Use Cases for IBM System Z 17.10 - UK Met Office - Why IBM Spectrum Scale with System Z 17.40 - Spectrum Scale on IBM Z 18.10 - Questions & Close 18.15 - Drinks & Networking Location: Room B4 Beaujolais Novotel London West 1 Shortlands London W6 8DR United Kingdom 020 7660 0680 Thanks, -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Mar 15 14:57:41 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 15 Mar 2018 09:57:41 -0500 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: Does the mirrored-storage vendor guarantee the sequence of all writes to all the LUNs at the remote-site exactly matches the sequence of writes to the local site....? If not.. the file system on the remote-site could be left in an inconsistent state when the communications line is cut... Guaranteeing sequencing to each LUN is not sufficient, because a typical GPFS file system has its data and metadata spread over several LUNs. From: "Olaf Weiser" To: gpfsug main discussion list Date: 03/14/2018 07:19 PM Subject: Re: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org HI Mark.. yes.. that's possible... at least , I'm sure.. there was a chapter in the former advanced admin guide of older releases with PPRC .. how to do that.. similar to PPRC , you might use other methods , but from gpfs perspective this should'nt make a difference.. and I had have a german customer, who was doing this for years... (but it is some years back meanwhile ... hihi time flies...) From: Mark Bush To: "gpfsug-discuss at spectrumscale.org" Date: 03/14/2018 09:11 PM Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org Is it possible (albeit not advisable) to mirror LUNs that are NSD?s to another storage array in another site basically for DR purposes? Once it?s mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=vq-nGaYTObfhVeW9E8fpLCJ9MIi9SNCiO5yYfXwJWhY&s=9o--h1_iFfwOmI2jRmxRjZSJX7IfQSFwUi6AfFhEas0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Thu Mar 15 15:07:30 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Thu, 15 Mar 2018 11:07:30 -0400 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: <26547.1521126450@turing-police.cc.vt.edu> On Wed, 14 Mar 2018 15:36:32 -0000, Mark Bush said: > Is it possible (albeit not advisable) to mirror LUNs that are NSD's to > another storage array in another site basically for DR purposes? Once it's > mirrored to a new cluster elsewhere what would be the step to get the > filesystem back up and running. I know that AFM-DR is meant for this but in > this case my client only has Standard edition and has mirroring software > purchased with the underlying disk array. > Is this even doable? We had a discussion on the list about this recently. The upshot is that it's sort of doable, but depends on what failure modes you're trying to protect against. The basic problem is that if you're doing mirroring at the array level, there's a certain amount of skew delay where GPFS has written stuff on the local disk and it hasn't been copied to the remote disk (basically the same reason why running fsck on a mounted disk partition can be problematic). There's also issues if things are scribbling on the local file system and generating enough traffic to saturate the network link you're doing the mirroring over, for a long enough time to overwhelm the mirroring mechanism (both sync and async mirroring have their good and bad sides in that scenario) We're using a stretch cluster with GPFS replication to storage about 95 cable miles away - that has the advantage that then GPFS knows there's a remote replica and can take more steps to make sure the remote copy is consistent. In particular, if it knows there's replication that needs to be done and it's getting backlogged, it can present a slow-down to the local writers and ensure that the remote set of disks don't fall too far behind.... (There's some funkyness having to do with quorum - it's *really* hard to set up so you have both protection against split-brain and the ability to start up the remote site stand-alone - mostly because from the remote point of view, starting up stand-alone after the main site fails looks identical to split-brain) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From janfrode at tanso.net Thu Mar 15 17:12:23 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 15 Mar 2018 18:12:23 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: > You could also use the GPFS prestartup callback (mmaddcallback) to execute > a script synchronously that waits for the IB ports to become available > before returning and allowing GPFS to continue. Not systemd integrated but > it should work. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 <(720)%20430-8821> > stockf at us.ibm.com > > > > From: david_johnson at brown.edu > To: gpfsug main discussion list > Date: 03/08/2018 07:34 AM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > become active > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Until IBM provides a solution, here is my workaround. Add it so it runs > before the gpfs script, I call it from our custom xcat diskless boot > scripts. Based on rhel7, not fully systemd integrated. YMMV! > > Regards, > ? ddj > ??- > [ddj at storage041 ~]$ cat /etc/init.d/ibready > #! /bin/bash > # > # chkconfig: 2345 06 94 > # /etc/rc.d/init.d/ibready > # written in 2016 David D Johnson (ddj *brown.edu* > > ) > # > ### BEGIN INIT INFO > # Provides: ibready > # Required-Start: > # Required-Stop: > # Default-Stop: > # Description: Block until infiniband is ready > # Short-Description: Block until infiniband is ready > ### END INIT INFO > > RETVAL=0 > if [[ -d /sys/class/infiniband ]] > then > IBDEVICE=$(dirname $(grep -il infiniband > /sys/class/infiniband/*/ports/1/link* | head -n 1)) > fi > # See how we were called. > case "$1" in > start) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo -n "Polling for InfiniBand link up: " > for (( count = 60; count > 0; count-- )) > do > if grep -q ACTIVE $IBDEVICE/state > then > echo ACTIVE > break > fi > echo -n "." > sleep 5 > done > if (( count <= 0 )) > then > echo DOWN - $0 timed out > fi > fi > ;; > stop|restart|reload|force-reload|condrestart|try-restart) > ;; > status) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo "$IBDEVICE is $(< $IBDEVICE/state) $(< > $IBDEVICE/rate)" > else > echo "No IBDEVICE found" > fi > ;; > *) > echo "Usage: ibready {start|stop|status|restart| > reload|force-reload|condrestart|try-restart}" > exit 2 > esac > exit ${RETVAL} > ???? > > -- ddj > Dave Johnson > > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) < > *marc.caubet at psi.ch* > wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the > IB link becomes up. Is there a way to force GPFS waiting to start until IB > ports are up? This can be probably done by adding something like > After=network-online.target and Wants=network-online.target in the systemd > file but I would like to know if this is natively possible from the GPFS > configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 <+41%2056%20310%2046%2067> > E-Mail: *marc.caubet at psi.ch* > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_ > iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_ > Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqF > yIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Mar 15 17:23:38 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 15 Mar 2018 12:23:38 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: The callback is the only way I know to use the "--onerror shutdown" option. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/15/2018 01:14 PM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=79jdzLLNtYEi36P6EifUd1cEI2GcLu2QWCwYwln12xg&s=AgoxRgQ2Ht0ZWCfogYsyg72RZn33CfTEyW7h1JQWRrM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Mar 15 17:30:49 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Mar 2018 18:30:49 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: An HTML attachment was scrubbed... URL: From chris.schlipalius at pawsey.org.au Fri Mar 16 06:11:39 2018 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Fri, 16 Mar 2018 14:11:39 +0800 Subject: [gpfsug-discuss] Reminder, 2018 March 26th Singapore Spectrum Scale User Group event is on soon. In-Reply-To: <54D1282A-175F-42AD-8BCA-DDA48326B80C@pawsey.org.au> References: <54D1282A-175F-42AD-8BCA-DDA48326B80C@pawsey.org.au> Message-ID: <988B0149-D942-41AD-93B9-E9A0ACAF7D9F@pawsey.org.au> Hello, This is a reminder for the the inaugural Spectrum Scale Usergroup Singapore on Monday 26th March 2018, Sentosa, Singapore. This event occurs just before SCA18 starts and is being held in conjunction with SCA18 https://sc-asia.org/ All current Singapore Spectrum Scale User Group event details can be found here: http://goo.gl/dXtqvS Feel free to circulate this event link to all that may need it. Please reserve your tickets now as tickets for places will close soon. There are some great speakers and topics, for details please see the agenda on Eventbrite. We are looking forwards to a great new Usergroup in a fabulous venue. Thanks again to NSCC and IBM for helping to arrange the venue and event booking. Regards, Chris Schlipalius IBM Champion 2018 Team Lead, Storage Infrastructure, Data & Visualisation, The Pawsey Supercomputing Centre (CSIRO) 13 Burvill Court Kensington WA 6151 Australia Tel +61 8 6436 8815 Email chris.schlipalius at pawsey.org.au Web www.pawsey.org.au From janfrode at tanso.net Fri Mar 16 08:29:59 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 16 Mar 2018 09:29:59 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: Thanks Olaf, but we don't use NetworkManager on this cluster.. I now created this simple script: ------------------------------------------------------------------------------------------------------------------------------------------------------------- #! /bin/bash - # # Fail mmstartup if not all configured IB ports are active. # # Install with: # # mmaddcallback fail-if-ibfail --command /var/mmfs/etc/fail-if-ibfail --event preStartup --sync --onerror shutdown # for port in $(/usr/lpp/mmfs/bin/mmdiag --config|grep verbsPorts | cut -f 4- -d " ") do grep -q ACTIVE /sys/class/infiniband/${port%/*}/ports/${port##*/}/state || exit 1 done ------------------------------------------------------------------------------------------------------------------------------------------------------------- which I haven't tested, but assume should work. Suggestions for improvements would be much appreciated! -jf On Thu, Mar 15, 2018 at 6:30 PM, Olaf Weiser wrote: > > you can try : > systemctl enable NetworkManager-wait-online > ln -s '/usr/lib/systemd/system/NetworkManager-wait-online.service' > '/etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online. > service' > > in many cases .. it helps .. > > > > > > From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 03/15/2018 06:18 PM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > becomeactive > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > I found some discussion on this at > *https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25* > and > there it's claimed that none of the callback events are early enough to > resolve this. That we need a pre-preStartup trigger. Any idea if this has > changed -- or is the callback option then only to do a "--onerror > shutdown" if it has failed to connect IB ? > > > On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock <*stockf at us.ibm.com* > > wrote: > You could also use the GPFS prestartup callback (mmaddcallback) to execute > a script synchronously that waits for the IB ports to become available > before returning and allowing GPFS to continue. Not systemd integrated but > it should work. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | *720-430-8821* <(720)%20430-8821> > *stockf at us.ibm.com* > > > > From: *david_johnson at brown.edu* > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > > > Date: 03/08/2018 07:34 AM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > become active > Sent by: *gpfsug-discuss-bounces at spectrumscale.org* > > ------------------------------ > > > > > Until IBM provides a solution, here is my workaround. Add it so it runs > before the gpfs script, I call it from our custom xcat diskless boot > scripts. Based on rhel7, not fully systemd integrated. YMMV! > > Regards, > ? ddj > ??- > [ddj at storage041 ~]$ cat /etc/init.d/ibready > #! /bin/bash > # > # chkconfig: 2345 06 94 > # /etc/rc.d/init.d/ibready > # written in 2016 David D Johnson (ddj *brown.edu* > > ) > # > ### BEGIN INIT INFO > # Provides: ibready > # Required-Start: > # Required-Stop: > # Default-Stop: > # Description: Block until infiniband is ready > # Short-Description: Block until infiniband is ready > ### END INIT INFO > > RETVAL=0 > if [[ -d /sys/class/infiniband ]] > then > IBDEVICE=$(dirname $(grep -il infiniband > /sys/class/infiniband/*/ports/1/link* | head -n 1)) > fi > # See how we were called. > case "$1" in > start) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo -n "Polling for InfiniBand link up: " > for (( count = 60; count > 0; count-- )) > do > if grep -q ACTIVE $IBDEVICE/state > then > echo ACTIVE > break > fi > echo -n "." > sleep 5 > done > if (( count <= 0 )) > then > echo DOWN - $0 timed out > fi > fi > ;; > stop|restart|reload|force-reload|condrestart|try-restart) > ;; > status) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo "$IBDEVICE is $(< $IBDEVICE/state) $(< > $IBDEVICE/rate)" > else > echo "No IBDEVICE found" > fi > ;; > *) > echo "Usage: ibready {start|stop|status|restart| > reload|force-reload|condrestart|try-restart}" > exit 2 > esac > exit ${RETVAL} > ???? > > -- ddj > Dave Johnson > > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) < > *marc.caubet at psi.ch* > wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the > IB link becomes up. Is there a way to force GPFS waiting to start until IB > ports are up? This can be probably done by adding something like > After=network-online.target and Wants=network-online.target in the systemd > file but I would like to know if this is natively possible from the GPFS > configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: *+41 56 310 46 67* <+41%2056%20310%2046%2067> > E-Mail: *marc.caubet at psi.ch* > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e=* > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Fri Mar 16 08:46:37 2018 From: YARD at il.ibm.com (Yaron Daniel) Date: Fri, 16 Mar 2018 10:46:37 +0200 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: Hi You can have few options: 1) Active/Active GPFS sites - with sync replication of the storage - take into account the latency you have. 2) Active/StandBy Gpfs sites- with a-sync replication of the storage. All info can be found at : https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adv_continous_replication_SSdata.htm Synchronous mirroring with GPFS replication In a configuration utilizing GPFS? replication, a single GPFS cluster is defined over three geographically-separate sites consisting of two production sites and a third tiebreaker site. One or more file systems are created, mounted, and accessed concurrently from the two active production sites. Synchronous mirroring utilizing storage based replication This topic describes synchronous mirroring utilizing storage-based replication. Point In Time Copy of IBM Spectrum Scale data Most storage systems provides functionality to make a point-in-time copy of data as an online backup mechanism. This function provides an instantaneous copy of the original data on the target disk, while the actual copy of data takes place asynchronously and is fully transparent to the user. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Storage Architect Petach Tiqva, 49527 IBM Global Markets, Systems HW Sales Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Mark Bush To: "gpfsug-discuss at spectrumscale.org" Date: 03/14/2018 10:10 PM Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org Is it possible (albeit not advisable) to mirror LUNs that are NSD?s to another storage array in another site basically for DR purposes? Once it?s mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=c9HNr6pLit8n4hQKpcYyyRg9ZnITpo_2OiEx6hbukYA&s=qFgC1ebi1SJvnCRlc92cI4hZqZYpK7EneZ0Sati5s5E&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4376 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 5093 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4746 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4557 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 11294 bytes Desc: not available URL: From stockf at us.ibm.com Fri Mar 16 12:05:29 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 16 Mar 2018 07:05:29 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: I have my doubts that mmdiag can be used in this script. In general the guidance is to avoid or be very careful with mm* commands in a callback due to the potential for deadlock. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/16/2018 04:30 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Olaf, but we don't use NetworkManager on this cluster.. I now created this simple script: ------------------------------------------------------------------------------------------------------------------------------------------------------------- #! /bin/bash - # # Fail mmstartup if not all configured IB ports are active. # # Install with: # # mmaddcallback fail-if-ibfail --command /var/mmfs/etc/fail-if-ibfail --event preStartup --sync --onerror shutdown # for port in $(/usr/lpp/mmfs/bin/mmdiag --config|grep verbsPorts | cut -f 4- -d " ") do grep -q ACTIVE /sys/class/infiniband/${port%/*}/ports/${port##*/}/state || exit 1 done ------------------------------------------------------------------------------------------------------------------------------------------------------------- which I haven't tested, but assume should work. Suggestions for improvements would be much appreciated! -jf On Thu, Mar 15, 2018 at 6:30 PM, Olaf Weiser wrote: you can try : systemctl enable NetworkManager-wait-online ln -s '/usr/lib/systemd/system/NetworkManager-wait-online.service' '/etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online.service' in many cases .. it helps .. From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/15/2018 06:18 PM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=xImYTxt4pm1o5znVn5Vdoka2uxgsTRpmlCGdEWhB9vw&s=veOZZz80aBzoCTKusx6WOpVlYs64eNkp5pM9kbHgvic&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Fri Mar 16 14:25:52 2018 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Fri, 16 Mar 2018 14:25:52 +0000 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads Message-ID: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> Hello GPFS Team, We are observing strange behavior of GPFS during startup on SLES12 node. In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base and when GPFS starts for the first time on this node, it complains about too little NSD threads: .. 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.2.3.7 Built: Feb 15 2018 11:38:38} ... 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... .. 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... 2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 more threads, exceeds max thread count 1024 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down. 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not initialize network shared disks 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 2018-03-16_13:11:30.701+0100: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds before restarting mmfsd 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup GPFS starts loop and tries to respawn mmfsd periodically: 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds before restarting mmfsd It seems that this issue can be resolved by doing mmshutdown. Later, when we manually perform mmstartup the problem is gone. We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11 SP4. Only on VLP1 we installed SLES12 SP3. The test cluster looks as below: Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------- 1 VLP0.cs-intern 192.168.101.210 VLP0.cs-intern quorum-manager-snmp_collector 2 VLP1.cs-intern 192.168.101.211 VLP1.cs-intern quorum-manager 3 TBP0.cs-intern 192.168.101.215 TBP0.cs-intern quorum 4 IDP0.cs-intern 192.168.101.110 IDP0.cs-intern 5 IDP1.cs-intern 192.168.101.111 IDP1.cs-intern 6 IDP2.cs-intern 192.168.101.112 IDP2.cs-intern 7 IDP3.cs-intern 192.168.101.113 IDP3.cs-intern 8 ICP0.cs-intern 192.168.101.10 ICP0.cs-intern 9 ICP1.cs-intern 192.168.101.11 ICP1.cs-intern 10 ICP2.cs-intern 192.168.101.12 ICP2.cs-intern 11 ICP3.cs-intern 192.168.101.13 ICP3.cs-intern 12 ICP4.cs-intern 192.168.101.14 ICP4.cs-intern 13 ICP5.cs-intern 192.168.101.15 ICP5.cs-intern We have enabled traces and reproduced the issue as follows: 1. When GPFS daemon was in a respawn loop, we have started traces, all files from this period you can find in uploaded archive under 1_nsd_threads_problem directory 2. We have manually stopped the "respawn" loop on VLP1 by executing mmshutdown and start GPFS manually by mmstartup. All traces from this execution can be found in archive file under 2_mmshutdown_mmstartup directory All data related to this problem is uploaded to our ftp to file: ftp.ts.fujitsu.com/CS-Diagnose/IBM, (fe_cs_oem, 12Monkeys) item435_nsd_threads.tar.gz Could you please have a look at this problem? Best regards, Tomasz Wolski -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Mar 16 14:52:11 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 16 Mar 2018 10:52:11 -0400 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads In-Reply-To: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> References: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> Message-ID: <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> Ah. You, my friend, have been struck by a smooth criminal. And by smooth criminal I mean systemd. I ran into this last week and spent many hours banging my head against the wall trying to figure it out. systemd by default limits cgroups to I think 512 tasks and since a thread counts as a task that's likely what you're running into. Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then reboot (and yes, I mean reboot. changing it live doesn't seem possible because of the infinite wisdom of the systemd developers). The pid limit of a given slice/unit cgroup may already be overriden to something more reasonable than the 512 default so if, for example, you were logging in and startng it via ssh the limit may be different than if its started from the gpfs.service unit because mmfsd effectively is running in different cgroups in each case. Hope that helps! -Aaron On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote: > Hello GPFS Team, > > We are observing strange behavior of GPFS during startup on SLES12 node. > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base > and when GPFS starts for the first time on this node, it complains about > > too little NSD threads: > > .. > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > {Version: 4.2.3.7?? Built: Feb 15 2018 11:38:38} ... > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > .. > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > more threads, exceeds max thread count 1024_* > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down. > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > initialize network shared disks > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > Removing old /var/adm/ras/mmfs.log.* files: > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > before restarting mmfsd > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > GPFS starts loop and tries to respawn mmfsd periodically: > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > before restarting mmfsd_* > > It seems that this issue can be resolved by doing mmshutdown. Later, > when we manually perform mmstartup the problem is gone. > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11 > SP4. Only on VLP1 we installed SLES12 SP3. > > The test cluster looks as below: > > Node? Daemon node name? IP address?????? Admin node name? Designation > > ----------------------------------------------------------------------- > > ?? 1?? VLP0.cs-intern??? 192.168.101.210? VLP0.cs-intern > quorum-manager-snmp_collector > > ?? 2?? VLP1.cs-intern??? 192.168.101.211? VLP1.cs-intern?? quorum-manager > > ?? 3?? TBP0.cs-intern??? 192.168.101.215? TBP0.cs-intern?? quorum > > ?? 4?? IDP0.cs-intern??? 192.168.101.110? IDP0.cs-intern > > ?? 5?? IDP1.cs-intern??? 192.168.101.111? IDP1.cs-intern > > ?? 6?? IDP2.cs-intern??? 192.168.101.112? IDP2.cs-intern > > ?? 7?? IDP3.cs-intern??? 192.168.101.113? IDP3.cs-intern > > ?? 8?? ICP0.cs-intern??? 192.168.101.10?? ICP0.cs-intern > > ?? 9?? ICP1.cs-intern??? 192.168.101.11?? ICP1.cs-intern > > ? 10?? ICP2.cs-intern??? 192.168.101.12?? ICP2.cs-intern > > ? 11?? ICP3.cs-intern??? 192.168.101.13?? ICP3.cs-intern > > ? 12?? ICP4.cs-intern??? 192.168.101.14?? ICP4.cs-intern > > ? 13?? ICP5.cs-intern??? 192.168.101.15?? ICP5.cs-intern > > We have enabled traces and reproduced the issue as follows: > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > files from this period you can find in uploaded archive under > *_1_nsd_threads_problem_* directory > > 2.We have manually stopped the ?respawn? loop on VLP1 by executing > mmshutdown and start GPFS manually by mmstartup. All traces from this > execution can be found in archive file under *_2_mmshutdown_mmstartup > _*directory > > All data related to this problem is uploaded to our ftp to file: > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > , (fe_cs_oem, 12Monkeys) > item435_nsd_threads.tar.gz > > Could you please have a look at this problem? > > Best regards, > > Tomasz Wolski > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Tomasz.Wolski at ts.fujitsu.com Fri Mar 16 15:01:08 2018 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Fri, 16 Mar 2018 15:01:08 +0000 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads In-Reply-To: <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> References: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> Message-ID: <679be18ca4ea4a29b0ba8cb5f49d0f1b@R01UKEXCASM223.r01.fujitsu.local> Hi Aaron, Thanks for the hint! :) Best regards, Tomasz Wolski > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Aaron Knister > Sent: Friday, March 16, 2018 3:52 PM > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread > configuration needs more threads > > Ah. You, my friend, have been struck by a smooth criminal. And by smooth > criminal I mean systemd. I ran into this last week and spent many hours > banging my head against the wall trying to figure it out. > > systemd by default limits cgroups to I think 512 tasks and since a thread > counts as a task that's likely what you're running into. > > Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then > reboot (and yes, I mean reboot. changing it live doesn't seem possible > because of the infinite wisdom of the systemd developers). > > The pid limit of a given slice/unit cgroup may already be overriden to > something more reasonable than the 512 default so if, for example, you > were logging in and startng it via ssh the limit may be different than if its > started from the gpfs.service unit because mmfsd effectively is running in > different cgroups in each case. > > Hope that helps! > > -Aaron > > On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote: > > Hello GPFS Team, > > > > We are observing strange behavior of GPFS during startup on SLES12 node. > > > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a > > base and when GPFS starts for the first time on this node, it > > complains about > > > > too little NSD threads: > > > > .. > > > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > > {Version: 4.2.3.7?? Built: Feb 15 2018 11:38:38} ... > > > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > > > .. > > > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > > more threads, exceeds max thread count 1024_* > > > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting > down. > > > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > > initialize network shared disks > > > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > > > Removing old /var/adm/ras/mmfs.log.* files: > > > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > > before restarting mmfsd > > > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > > > GPFS starts loop and tries to respawn mmfsd periodically: > > > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 > seconds > > before restarting mmfsd_* > > > > It seems that this issue can be resolved by doing mmshutdown. Later, > > when we manually perform mmstartup the problem is gone. > > > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running > > SLES11 SP4. Only on VLP1 we installed SLES12 SP3. > > > > The test cluster looks as below: > > > > Node? Daemon node name? IP address?????? Admin node name? Designation > > > > ---------------------------------------------------------------------- > > - > > > > ?? 1?? VLP0.cs-intern??? 192.168.101.210? VLP0.cs-intern > > quorum-manager-snmp_collector > > > > ?? 2?? VLP1.cs-intern??? 192.168.101.211? VLP1.cs-intern > > quorum-manager > > > > ?? 3?? TBP0.cs-intern??? 192.168.101.215? TBP0.cs-intern?? quorum > > > > ?? 4?? IDP0.cs-intern??? 192.168.101.110? IDP0.cs-intern > > > > ?? 5?? IDP1.cs-intern??? 192.168.101.111? IDP1.cs-intern > > > > ?? 6?? IDP2.cs-intern??? 192.168.101.112? IDP2.cs-intern > > > > ?? 7?? IDP3.cs-intern??? 192.168.101.113? IDP3.cs-intern > > > > ?? 8?? ICP0.cs-intern??? 192.168.101.10?? ICP0.cs-intern > > > > ?? 9?? ICP1.cs-intern??? 192.168.101.11?? ICP1.cs-intern > > > > ? 10?? ICP2.cs-intern??? 192.168.101.12?? ICP2.cs-intern > > > > ? 11?? ICP3.cs-intern??? 192.168.101.13?? ICP3.cs-intern > > > > ? 12?? ICP4.cs-intern??? 192.168.101.14?? ICP4.cs-intern > > > > ? 13?? ICP5.cs-intern??? 192.168.101.15?? ICP5.cs-intern > > > > We have enabled traces and reproduced the issue as follows: > > > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > > files from this period you can find in uploaded archive under > > *_1_nsd_threads_problem_* directory > > > > 2.We have manually stopped the ?respawn? loop on VLP1 by executing > > mmshutdown and start GPFS manually by mmstartup. All traces from this > > execution can be found in archive file under > *_2_mmshutdown_mmstartup > > _*directory > > > > All data related to this problem is uploaded to our ftp to file: > > > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > > , (fe_cs_oem, 12Monkeys) > > item435_nsd_threads.tar.gz > > > > Could you please have a look at this problem? > > > > Best regards, > > > > Tomasz Wolski > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From secretary at gpfsug.org Tue Mar 20 08:48:19 2018 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Tue, 20 Mar 2018 08:48:19 +0000 Subject: [gpfsug-discuss] Upcoming meetings Message-ID: <785558aa15b26dbd44c9e22de3b13ef9@webmail.gpfsug.org> Dear members, There are a number of opportunities over the coming weeks for you to meet face to face with other group members and hear from Spectrum Scale experts. We'd love to see you at one of the events! If you plan to attend, please register: Spectrum Scale Usergroup, Singapore, March 26, https://www.eventbrite.com/e/spectrum-scale-user-group-singapore-march-2018-tickets-40429354287 [1] UK 2018 User Group Event, London, April 18 - April 19, https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList [2] IBM Technical University: Spectrum Scale Meet Up, London, May 14 Please email Par Hettinga par at nl.ibm.com USA 2018 Spectrum Scale User Group, Boston, May 16 - May 17, https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist [3] Thanks for your support, Claire -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://www.eventbrite.com/e/spectrum-scale-user-group-singapore-march-2018-tickets-40429354287 [2] https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList [3] https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist -------------- next part -------------- An HTML attachment was scrubbed... URL: From willi.engeli at id.ethz.ch Wed Mar 21 16:04:10 2018 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Wed, 21 Mar 2018 16:04:10 +0000 Subject: [gpfsug-discuss] CTDB RFE opened @ IBM Would like to ask for your votes Message-ID: Dear Collegues, [WE] I have missed the discussion on the CTDB upgradeability with interruption free methods. However, I hit this topic as well and some of our users where hit by the short interruption badly because of the kind of work they had running. This motivated me to open an Request for Enhancement for CTDB to support in a future release the interruption-less Upgrade. Here is the Link for the RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117919 I hope this time it works at 1. Place...... Thanks in advance Willi -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From puthuppu at iu.edu Wed Mar 21 17:30:19 2018 From: puthuppu at iu.edu (Uthuppuru, Peter K) Date: Wed, 21 Mar 2018 17:30:19 +0000 Subject: [gpfsug-discuss] Hello Message-ID: <857be7f3815441c0a8e55816e61b6735@BL-CCI-D2S08.ads.iu.edu> Hello all, My name is Peter Uthuppuru and I work at Indiana University on the Research Storage team. I'm new to GPFS, HPC, etc. so I'm excited to learn more. Thanks, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5615 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Fri Mar 23 12:59:51 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 23 Mar 2018 12:59:51 +0000 Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for Speakers and Registration Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D@nuance.com> Reminder: The registration for the Spring meeting of the SSUG-USA is now open. This is a Free two-day and will include a large number of Spectrum Scale updates and breakout tracks. Please note that we have limited meeting space so please register early if you plan on attending. If you are interested in presenting, please contact me. We do have a few more slots for user presentations ? these do not need to be long. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Caron at sig.com Fri Mar 23 20:10:05 2018 From: Paul.Caron at sig.com (Caron, Paul) Date: Fri, 23 Mar 2018 20:10:05 +0000 Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <980f9a27290342e28197154b924d0abf@msx.bala.susq.com> Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" option for the file system is "cluster" * We have a pretty small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From G.Horton at bham.ac.uk Mon Mar 26 12:25:26 2018 From: G.Horton at bham.ac.uk (Gareth Horton) Date: Mon, 26 Mar 2018 11:25:26 +0000 Subject: [gpfsug-discuss] GPFS Encryption Message-ID: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Hi. All, I would be interested to hear if any members have experience implementing Encryption?, any gotchas, tips or any other information which may help with the preparation and implementation stages would be appreciated. I am currently reading through the documentation and reviewing the preparation steps, and with a scheduled maintenance window on the horizon it would be a good opportunity to carry out any preparatory steps requiring an outage. If there are any aspects of the configuration which in hindsight could have been done at the preparation stage this would be especially useful. Many Thanks Gareth ---------------------- On 24/03/2018, 12:00, "gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org" wrote: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Reminder - SSUG-US Spring meeting - Call for Speakers and Registration (Oesterlin, Robert) 2. Pool layoutMap option changes following GPFS upgrades (Caron, Paul) ---------------------------------------------------------------------- Message: 1 Date: Fri, 23 Mar 2018 12:59:51 +0000 From: "Oesterlin, Robert" To: gpfsug main discussion list Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for Speakers and Registration Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D at nuance.com> Content-Type: text/plain; charset="utf-8" Reminder: The registration for the Spring meeting of the SSUG-USA is now open. This is a Free two-day and will include a large number of Spectrum Scale updates and breakout tracks. Please note that we have limited meeting space so please register early if you plan on attending. If you are interested in presenting, please contact me. We do have a few more slots for user presentations ? these do not need to be long. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 23 Mar 2018 20:10:05 +0000 From: "Caron, Paul" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <980f9a27290342e28197154b924d0abf at msx.bala.susq.com> Content-Type: text/plain; charset="us-ascii" Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" option for the file system is "cluster" * We have a pretty small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 74, Issue 45 ********************************************** From chair at spectrumscale.org Mon Mar 26 12:52:26 2018 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Mon, 26 Mar 2018 12:52:26 +0100 Subject: [gpfsug-discuss] RFE Process ... Burning Issues Message-ID: <563267E8-EAE7-4C73-BA54-266DDE94AB02@spectrumscale.org> Hi All, We?ve been talking with product management about the RFE process and have agreed that we?ll try out a community-voting process. First up, we are piloting this idea, hopefully it will work out, but it may also need tweaks as we move forward. One of the things we?ve been asking for is for a better way for the Spectrum Scale user group community to vote on RFEs. Sure we get people posting to the list, but we?re looking at if we can make it a better/more formal process to support this. Talking with IBM, we also recognise that with a large number of RFEs, it can be difficult for them to track work tasks being completed, but with the community RFEs, there is a commitment to try and track them closely and report back on progress later in the year. To submit an RFE using this process, you must complete the form available at: https://ibm.box.com/v/EnhBlitz (Enhancement Blitz template v1.pptx) The form provides some guidance on a good and bad RFE. Sure a lot of us are techie/engineers, so please try to explain what problem you are solving rather than trying to provide a solution. (i.e. leave the technical implementation details to those with the source code). Each site is limited to 2 submissions and they will be looked over by the Spectrum Scale community leaders, we may ask people to merge requests, send back for more info etc, or there may be some that we know will just never be progressed for various reasons. At the April user group in the UK, we have an RFE (Burning issues) session planned. Submitters of the RFE will be expected to provide a 1-3 minute pitch for their RFE. We?ve placed the session at the end of the day (UK time) to try and ensure USA people can participate. Remote presentation of your RFE is fine and we plan to live-stream the session. Each person will have 3 votes to choose what they think are their highest priority requests. Again remote voting is perfectly fine but only 3 votes per person. The requests with the highest number of votes will then be given a higher chance of being implemented. There?s a possibility that some may even make the winter release cycle. Either way, we plan to track the ?chosen? RFEs more closely and provide an update at the November USA meeting (likely the SC18 one). The submission and voting process is also planned to be run again in time for the November meeting. Anyone wanting to submit an RFE for consideration should submit the form by email to rfe at spectrumscaleug.org *before* 13th April. We?ll be posting the submitted RFEs up at the box site as well, you are encouraged to visit the site regularly and check the submissions as you may want to contact the author of an RFE to provide more information/support the RFE. Anything received after this date will be held over to the November cycle. The earlier you submit, the better chance it has of being included (we plan to limit the number to be considered) and will give us time to review the RFE and come back for more information/clarification if needed. You must also be prepared to provide a 1-3 minute pitch for your RFE (in person or remote) for the UK user group meeting. You are welcome to submit any RFE you have already put into the RFE portal for this process to garner community votes for it. There is space on the form to provide the existing RFE number. If you have any comments on the process, you can also email them to rfe at spectrumscaleug.org as well. Thanks to Carl Zeite for supporting this plan? Get submitting! Simon (UK Group Chair) -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Mon Mar 26 13:14:35 2018 From: john.hearns at asml.com (John Hearns) Date: Mon, 26 Mar 2018 12:14:35 +0000 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> References: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Message-ID: Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Gareth Horton Sent: Monday, March 26, 2018 1:25 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GPFS Encryption Hi. All, I would be interested to hear if any members have experience implementing Encryption?, any gotchas, tips or any other information which may help with the preparation and implementation stages would be appreciated. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From S.J.Thompson at bham.ac.uk Mon Mar 26 13:46:47 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 26 Mar 2018 12:46:47 +0000 Subject: [gpfsug-discuss] GPFS Encryption Message-ID: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> John, I think we might need the decrypt key ... Simon ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. From jtucker at pixitmedia.com Mon Mar 26 13:48:56 2018 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 26 Mar 2018 13:48:56 +0100 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> References: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> Message-ID: <3c977368-266f-1ec3-bafc-03adf872bb4d@pixitmedia.com> Try.... http://www.rot13.com/ On 26/03/18 13:46, Simon Thompson (IT Research Support) wrote: > John, > > I think we might need the decrypt key ... > > Simon > > ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: > > Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Mar 26 13:19:11 2018 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 26 Mar 2018 08:19:11 -0400 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> References: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Message-ID: Hi Gareth: We have the spectrum archive product with encryption. It encrypts data on disk and tape...but not metadata. We originally had hoped to write small files with metadata...that does not happen with encryption. My guess is that the system pool(where metadata lives) cannot be encrypted. So you may pay a performance penalty for small files using encryption depending on what backends your data write policy. Eric On Mon, Mar 26, 2018 at 7:25 AM, Gareth Horton wrote: > Hi. All, > > I would be interested to hear if any members have experience implementing > Encryption?, any gotchas, tips or any other information which may help with > the preparation and implementation stages would be appreciated. > > I am currently reading through the documentation and reviewing the > preparation steps, and with a scheduled maintenance window on the horizon > it would be a good opportunity to carry out any preparatory steps requiring > an outage. > > If there are any aspects of the configuration which in hindsight could > have been done at the preparation stage this would be especially useful. > > Many Thanks > > Gareth > > ---------------------- > > On 24/03/2018, 12:00, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of gpfsug-discuss-request at spectrumscale.org" spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org> > wrote: > > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Reminder - SSUG-US Spring meeting - Call for Speakers and > Registration (Oesterlin, Robert) > 2. Pool layoutMap option changes following GPFS upgrades > (Caron, Paul) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 23 Mar 2018 12:59:51 +0000 > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for > Speakers and Registration > Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D at nuance.com> > Content-Type: text/plain; charset="utf-8" > > Reminder: The registration for the Spring meeting of the SSUG-USA is > now open. This is a Free two-day and will include a large number of > Spectrum Scale updates and breakout tracks. > > Please note that we have limited meeting space so please register > early if you plan on attending. If you are interested in presenting, please > contact me. We do have a few more slots for user presentations ? these do > not need to be long. > > You can register here: > > https://www.eventbrite.com/e/spectrum-scale-gpfs-user- > group-us-spring-2018-meeting-tickets-43662759489 > > DATE AND TIME > Wed, May 16, 2018, 9:00 AM ? > Thu, May 17, 2018, 5:00 PM EDT > > LOCATION > IBM Cambridge Innovation Center > One Rogers Street > Cambridge, MA 02142-1203 > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20180323/824dbcdc/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Fri, 23 Mar 2018 20:10:05 +0000 > From: "Caron, Paul" > To: "gpfsug-discuss at spectrumscale.org" > > Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS > upgrades > Message-ID: <980f9a27290342e28197154b924d0abf at msx.bala.susq.com> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > Has anyone run into a situation where the layoutMap option for a pool > changes from "scatter" to "cluster" following a GPFS software upgrade? We > recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to > 4.2.3.6. We noticed that the layoutMap option for two of our pools changed > following the upgrades. We didn't recreate the file system or any of the > pools. Further lab testing has revealed that the layoutMap option change > actually occurred during the first upgrade to 4.1.1.17, and it was simply > carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, > but they have told us that layoutMap option changes are impossible for > existing pools, and that a software upgrade couldn't do this. I sent the > results of my lab testing today, so I'm hoping to get a better response. > > We would rather not have to recreate all the pools, but it is starting > to look like that may be the only option to fix this. Also, it's unclear > if this could happen again during future upgrades. > > Here's some additional background. > > * The "-j" option for the file system is "cluster" > > * We have a pretty small cluster; just 13 nodes > > * When reproducing the problem, we noted that the layoutMap > option didn't change until the final node was upgraded > > * The layoutMap option changed before running the "mmchconfig > release=LATEST" and "mmchfs -V full" commands, so those don't seem to > be related to the problem > > Thanks, > > Paul C. > SIG > > > ________________________________ > > IMPORTANT: The information contained in this email and/or its > attachments is confidential. If you are not the intended recipient, please > notify the sender immediately by reply and immediately delete this message > and all its attachments. Any review, use, reproduction, disclosure or > dissemination of this message or any attachment by an unintended recipient > is strictly prohibited. Neither this message nor any attachment is intended > as or should be construed as an offer, solicitation or recommendation to > buy or sell any security or other financial instrument. Neither the sender, > his or her employer nor any of their respective affiliates makes any > warranties as to the completeness or accuracy of any of the information > contained herein or that this message or any of its attachments is free of > viruses. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20180323/181b0ac7/attachment-0001.html> > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 74, Issue 45 > ********************************************** > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Caron at sig.com Mon Mar 26 16:43:24 2018 From: Paul.Caron at sig.com (Caron, Paul) Date: Mon, 26 Mar 2018 15:43:24 +0000 Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <9b442159716e43f6a621c21f71067c0a@msx.bala.susq.com> By the way, the command to check the layoutMap option for your pools is "mmlspool all -L". Has anyone else noticed if this option changed during your GPFS software upgrades? Here's how our mmlspool output looked for our lab/test environment under GPFS Version 3.5.0-21: Pool: name = system poolID = 0 blockSize = 1024 KB usage = metadataOnly maxDiskSize = 16 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = writecache poolID = 65537 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.0 TB layoutMap = scatter allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = data poolID = 65538 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.2 TB layoutMap = scatter allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Here's the mmlspool output immediately after the upgrade to 4.1.1-17: Pool: name = system poolID = 0 blockSize = 1024 KB usage = metadataOnly maxDiskSize = 16 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = writecache poolID = 65537 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.0 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = data poolID = 65538 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.2 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 We also determined the following: * The layoutMap option changes back to "scatter" if we revert back to 3.5.0.21. It only changes back after the last node is downgraded. * Restarting GPFS under 4.1.1-17 (via mmshutdown and mmstartup) has no effect on layoutMap in the lab (as expected). So, a simple restart doesn't fix the problem. Our production and lab deployments are using SLES 11, SP3 (3.0.101-0.47.71-default). Thanks, Paul C. SIG From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Caron, Paul Sent: Friday, March 23, 2018 4:10 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" file system is "cluster" * We have a pretty option for the small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From JRLang at uwyo.edu Mon Mar 26 22:13:39 2018 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Mon, 26 Mar 2018 21:13:39 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Tue Mar 27 07:29:06 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Tue, 27 Mar 2018 06:29:06 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: <9a95b4b2c71748dfb4b39e23ffd4debf@SMXRF105.msg.hukrf.de> Hallo Jeff, you can check these with following cmd. mmfsadm dump nsdcksum Your in memory info is inconsistent with your descriptor structur on disk. The reason for this I had no idea. Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Jeffrey R. Lang Gesendet: Montag, 26. M?rz 2018 23:14 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Mar 27 07:44:29 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 27 Mar 2018 12:14:29 +0530 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: This means that the stripe group descriptor on the disk dcs3800u31b_lun7 is corrupted. As we maintain copies of the stripe group descriptor on other disks as well we can copy the good descriptor from one of those disks to this one. Please open a PMR and work with IBM support to get this fixed. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Jeffrey R. Lang" To: gpfsug main discussion list Date: 03/27/2018 04:15 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=3u8q7zs1oLvf23bMVLe5YO_0SFSILFiL1d85LRDp9aQ&s=lf2ivnySwvhLDS-AnJSbm6cWcpO2R-vdHOll5TvkBDU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeep.patil at in.ibm.com Tue Mar 27 12:53:50 2018 From: sandeep.patil at in.ibm.com (Sandeep Ramesh) Date: Tue, 27 Mar 2018 17:23:50 +0530 Subject: [gpfsug-discuss] Latest Technical Blogs on Spectrum Scale In-Reply-To: References: Message-ID: Dear User Group Members, In continuation , here are list of development blogs in the this quarter (Q1 2018). As discussed in User Groups, passing it along: GDPR Compliance and Unstructured Data Storage https://developer.ibm.com/storage/2018/03/27/gdpr-compliance-unstructure-data-storage/ IBM Spectrum Scale for Linux on IBM Z ? Release 5.0 features and highlights https://developer.ibm.com/storage/2018/03/09/ibm-spectrum-scale-linux-ibm-z-release-5-0-features-highlights/ Management GUI enhancements in IBM Spectrum Scale release 5.0.0 https://developer.ibm.com/storage/2018/01/18/gui-enhancements-in-spectrum-scale-release-5-0-0/ IBM Spectrum Scale 5.0.0 ? What?s new in NFS? https://developer.ibm.com/storage/2018/01/18/ibm-spectrum-scale-5-0-0-whats-new-nfs/ Benefits and implementation of Spectrum Scale sudo wrappers https://developer.ibm.com/storage/2018/01/15/benefits-implementation-spectrum-scale-sudo-wrappers/ IBM Spectrum Scale: Big Data and Analytics Solution Brief https://developer.ibm.com/storage/2018/01/15/ibm-spectrum-scale-big-data-analytics-solution-brief/ Variant Sub-blocks in Spectrum Scale 5.0 https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ Compression support in Spectrum Scale 5.0.0 https://developer.ibm.com/storage/2018/01/11/compression-support-spectrum-scale-5-0-0/ IBM Spectrum Scale Versus Apache Hadoop HDFS https://developer.ibm.com/storage/2018/01/10/spectrumscale_vs_hdfs/ ESS Fault Tolerance https://developer.ibm.com/storage/2018/01/09/ess-fault-tolerance/ Genomic Workloads ? How To Get it Right From Infrastructure Point Of View. https://developer.ibm.com/storage/2018/01/06/genomic-workloads-get-right-infrastructure-point-view/ IBM Spectrum Scale On AWS Cloud : This video explains how to deploy IBM Spectrum Scale on AWS. This solution helps the users who require highly available access to a shared name space across multiple instances with good performance, without requiring an in-depth knowledge of IBM Spectrum Scale. Detailed Demo : https://www.youtube.com/watch?v=6j5Xj_d0bh4 Brief Demo : https://www.youtube.com/watch?v=-aMQKPW_RfY. For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media From: Sandeep Ramesh/India/IBM To: gpfsug-discuss at spectrumscale.org Cc: Doris Conti/Poughkeepsie/IBM at IBMUS Date: 01/10/2018 12:13 PM Subject: Re: Latest Technical Blogs on Spectrum Scale Dear User Group Members, Here are list of development blogs in the last quarter. Passing it to this email group as Doris had got a feedback in the UG meetings to notify the members with the latest updates periodically. Genomic Workloads ? How To Get it Right From Infrastructure Point Of View. https://developer.ibm.com/storage/2018/01/06/genomic-workloads-get-right-infrastructure-point-view/ IBM Spectrum Scale Versus Apache Hadoop HDFS https://developer.ibm.com/storage/2018/01/10/spectrumscale_vs_hdfs/ ESS Fault Tolerance https://developer.ibm.com/storage/2018/01/09/ess-fault-tolerance/ IBM Spectrum Scale MMFSCK ? Savvy Enhancements https://developer.ibm.com/storage/2018/01/05/ibm-spectrum-scale-mmfsck-savvy-enhancements/ ESS Disk Management https://developer.ibm.com/storage/2018/01/02/ess-disk-management/ IBM Spectrum Scale Object Protocol On Ubuntu https://developer.ibm.com/storage/2018/01/01/ibm-spectrum-scale-object-protocol-ubuntu/ IBM Spectrum Scale 5.0 ? Whats new in Unified File and Object https://developer.ibm.com/storage/2017/12/20/ibm-spectrum-scale-5-0-whats-new-object/ A Complete Guide to ? Protocol Problem Determination Guide for IBM Spectrum Scale? ? Part 1 https://developer.ibm.com/storage/2017/12/19/complete-guide-protocol-problem-determination-guide-ibm-spectrum-scale-1/ IBM Spectrum Scale installation toolkit ? enhancements over releases https://developer.ibm.com/storage/2017/12/15/ibm-spectrum-scale-installation-toolkit-enhancements-releases/ Network requirements in an Elastic Storage Server Setup https://developer.ibm.com/storage/2017/12/13/network-requirements-in-an-elastic-storage-server-setup/ Co-resident migration with Transparent cloud tierin https://developer.ibm.com/storage/2017/12/05/co-resident-migration-transparent-cloud-tierin/ IBM Spectrum Scale on Hortonworks HDP Hadoop clusters : A Complete Big Data Solution https://developer.ibm.com/storage/2017/12/05/ibm-spectrum-scale-hortonworks-hdp-hadoop-clusters-complete-big-data-solution/ Big data analytics with Spectrum Scale using remote cluster mount & multi-filesystem support https://developer.ibm.com/storage/2017/11/28/big-data-analytics-spectrum-scale-using-remote-cluster-mount-multi-filesystem-support/ IBM Spectrum Scale HDFS Transparency Short Circuit Write Support https://developer.ibm.com/storage/2017/11/28/ibm-spectrum-scale-hdfs-transparency-short-circuit-write-support/ IBM Spectrum Scale HDFS Transparency Federation Support https://developer.ibm.com/storage/2017/11/27/ibm-spectrum-scale-hdfs-transparency-federation-support/ How to configure and performance tuning different system workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-different-system-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning Spark workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-spark-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning database workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-database-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning Hadoop workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/24/configure-performance-tuning-hadoop-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ IBM Spectrum Scale Sharing Nothing Cluster Performance Tuning https://developer.ibm.com/storage/2017/11/24/ibm-spectrum-scale-sharing-nothing-cluster-performance-tuning/ How to Configure IBM Spectrum Scale? with NIS based Authentication. https://developer.ibm.com/storage/2017/11/21/configure-ibm-spectrum-scale-nis-based-authentication/ For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media From: Sandeep Ramesh/India/IBM To: gpfsug-discuss at spectrumscale.org Cc: Doris Conti/Poughkeepsie/IBM at IBMUS Date: 11/16/2017 08:15 PM Subject: Latest Technical Blogs on Spectrum Scale Dear User Group members, Here are the Development Blogs in last 3 months on Spectrum Scale Technical Topics. Spectrum Scale Monitoring ? Know More ? https://developer.ibm.com/storage/2017/11/16/spectrum-scale-monitoring-know/ IBM Spectrum Scale 5.0 Release ? What?s coming ! https://developer.ibm.com/storage/2017/11/14/ibm-spectrum-scale-5-0-release-whats-coming/ Four Essentials things to know for managing data ACLs on IBM Spectrum Scale? from Windows https://developer.ibm.com/storage/2017/11/13/four-essentials-things-know-managing-data-acls-ibm-spectrum-scale-windows/ GSSUTILS: A new way of running SSR, Deploying or Upgrading ESS Server https://developer.ibm.com/storage/2017/11/13/gssutils/ IBM Spectrum Scale Object Authentication https://developer.ibm.com/storage/2017/11/02/spectrum-scale-object-authentication/ Video Surveillance ? Choosing the right storage https://developer.ibm.com/storage/2017/11/02/video-surveillance-choosing-right-storage/ IBM Spectrum scale object deep dive training with problem determination https://www.slideshare.net/SmitaRaut/ibm-spectrum-scale-object-deep-dive-training Spectrum Scale as preferred software defined storage for Ubuntu OpenStack https://developer.ibm.com/storage/2017/09/29/spectrum-scale-preferred-software-defined-storage-ubuntu-openstack/ IBM Elastic Storage Server 2U24 Storage ? an All-Flash offering, a performance workhorse https://developer.ibm.com/storage/2017/10/06/ess-5-2-flash-storage/ A Complete Guide to Configure LDAP-based authentication with IBM Spectrum Scale? for File Access https://developer.ibm.com/storage/2017/09/21/complete-guide-configure-ldap-based-authentication-ibm-spectrum-scale-file-access/ Deploying IBM Spectrum Scale on AWS Quick Start https://developer.ibm.com/storage/2017/09/18/deploy-ibm-spectrum-scale-on-aws-quick-start/ Monitoring Spectrum Scale Object metrics https://developer.ibm.com/storage/2017/09/14/monitoring-spectrum-scale-object-metrics/ Tier your data with ease to Spectrum Scale Private Cloud(s) using Moonwalk Universal https://developer.ibm.com/storage/2017/09/14/tier-data-ease-spectrum-scale-private-clouds-using-moonwalk-universal/ Why do I see owner as ?Nobody? for my export mounted using NFSV4 Protocol on IBM Spectrum Scale?? https://developer.ibm.com/storage/2017/09/08/see-owner-nobody-export-mounted-using-nfsv4-protocol-ibm-spectrum-scale/ IBM Spectrum Scale? Authentication using Active Directory and LDAP https://developer.ibm.com/storage/2017/09/01/ibm-spectrum-scale-authentication-using-active-directory-ldap/ IBM Spectrum Scale? Authentication using Active Directory and RFC2307 https://developer.ibm.com/storage/2017/09/01/ibm-spectrum-scale-authentication-using-active-directory-rfc2307/ High Availability Implementation with IBM Spectrum Virtualize and IBM Spectrum Scale https://developer.ibm.com/storage/2017/08/30/high-availability-implementation-ibm-spectrum-virtualize-ibm-spectrum-scale/ 10 Frequently asked Questions on configuring Authentication using AD + AUTO ID mapping on IBM Spectrum Scale?. https://developer.ibm.com/storage/2017/08/04/10-frequently-asked-questions-configuring-authentication-using-ad-auto-id-mapping-ibm-spectrum-scale/ IBM Spectrum Scale? Authentication using Active Directory https://developer.ibm.com/storage/2017/07/30/ibm-spectrum-scale-auth-using-active-directory/ Five cool things that you didn?t know Transparent Cloud Tiering on Spectrum Scale can do https://developer.ibm.com/storage/2017/07/29/five-cool-things-didnt-know-transparent-cloud-tiering-spectrum-scale-can/ IBM Spectrum Scale GUI videos https://developer.ibm.com/storage/2017/07/25/ibm-spectrum-scale-gui-videos/ IBM Spectrum Scale? Authentication ? Planning for NFS Access https://developer.ibm.com/storage/2017/07/24/ibm-spectrum-scale-planning-nfs-access/ For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media -------------- next part -------------- An HTML attachment was scrubbed... URL: From bipcuds at gmail.com Tue Mar 27 23:26:16 2018 From: bipcuds at gmail.com (Keith Ball) Date: Tue, 27 Mar 2018 18:26:16 -0400 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Message-ID: Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Mar 28 00:44:33 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 27 Mar 2018 23:44:33 +0000 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? In-Reply-To: References: Message-ID: <7ae89940fa234b79b3538be339109cba@jumptrading.com> What version of GPFS are you running Keith? All nodes mounting the file system must briefly quiesce I/O operations during the snapshot create operations, hence the ?Quiescing all file system operations.? message in the output. So don?t really see a way to specify a specific set of nodes for these operations. They have made updates in newer releases of GPFS to combine operations (e.g. create and delete snapshots at the same time) which IBM says ?system performance is increased by batching operations and reducing overhead.?. Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can help them respond more quickly to quiesce I/O requests. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Tuesday, March 27, 2018 5:26 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dwayne.Hart at med.mun.ca Wed Mar 28 15:56:55 2018 From: Dwayne.Hart at med.mun.ca (Dwayne.Hart at med.mun.ca) Date: Wed, 28 Mar 2018 14:56:55 +0000 Subject: [gpfsug-discuss] Introduction to the "gpfsug-discuss" mailing list Message-ID: Hi, My name is Dwayne Hart. I currently work for the Center for Health Informatics & Analytics (CHIA), Faculty of Medicine at Memorial University of Newfoundland as a Systems/Network Security Administrator. In this role I am responsible for several HPC (Intel and Power) instances, OpenStack cloud environment and research data. We leverage IBM Spectrum Scale Storage as our primary storage solution. I have been working with GPFS since 2015. Best, Dwayne --- Systems Administrator Center for Health Informatics & Analytics (CHIA) Craig L. Dobbin Center for Genetics Room 4M409 300 Prince Philip Dr. St. John?s, NL Canada A1B 3V6 Tel: (709) 864-6631 E Mail: dwayne.hart at med.mun.ca -------------- next part -------------- An HTML attachment was scrubbed... URL: From ingo.altenburger at id.ethz.ch Thu Mar 29 13:20:45 2018 From: ingo.altenburger at id.ethz.ch (Altenburger Ingo (ID SD)) Date: Thu, 29 Mar 2018 12:20:45 +0000 Subject: [gpfsug-discuss] REST API function for 'mmsmb exportacl list' Message-ID: We were very hopeful to replace our storage provisioning automation based on cli commands with the new functions provided in REST API. Since it seems that almost all protocol related commands are already implemented with 5.0.0.1 REST interface, we have still not found an equivalent for mmsmb exportacl list to get the share permissions of a share. Does anybody know that this is already in but not yet documented or is it for sure still not under consideration? Thanks Ingo -------------- next part -------------- An HTML attachment was scrubbed... URL: From delmard at br.ibm.com Thu Mar 29 14:41:53 2018 From: delmard at br.ibm.com (Delmar Demarchi) Date: Thu, 29 Mar 2018 10:41:53 -0300 Subject: [gpfsug-discuss] AFM-DR Questions Message-ID: Hello experts. We have a Scale project with AFM-DR to be implemented and after read the KC documentation, we have some questions about. - Do you know any reason why we changed the Recovery point objective (RPO) snapshots by 15 to 720 minutes in the version 5.0.0 of IBM Spectrum Scale AFM-DR? - Can we use additional Independent Peer-snapshots to reduce the RPO interval (720 minutes) of IBM Spectrum Scale AFM-DR? - In addition to the above question, can we use these snapshots to update the new primary site after a failover occur for the most up to date snapshot? - According to the documentation, we are not able to replicate Dependent filesets, but if these dependents filesets are part of an existing Independent fileset. Do you see any issues/concerns with this? Thank you in advance. Delmar Demarchi .'. (delmard at br.ibm.com) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Mar 29 17:00:57 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Mar 2018 16:00:57 +0000 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <3c977368-266f-1ec3-bafc-03adf872bb4d@pixitmedia.com> Message-ID: I tried a dictionary attack, but ?nalguvta? was a typo. Should have been: ?Fbeel Tnergu. Pnaabg nqq nalguvat hfrshy urer? ? John: anythign (sic) to add? :-) Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com > On 26 Mar 2018, at 14:49, Jez Tucker wrote: > > Try.... http://www.rot13.com/ > >> On 26/03/18 13:46, Simon Thompson (IT Research Support) wrote: >> John, >> >> I think we might need the decrypt key ... >> >> Simon >> >> ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: >> >> Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Jez Tucker > Head of Research and Development, Pixit Media > 07764193820 | jtucker at pixitmedia.com > www.pixitmedia.com | Tw:@pixitmedia.com > > > This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: From bipcuds at gmail.com Thu Mar 29 17:15:19 2018 From: bipcuds at gmail.com (Keith Ball) Date: Thu, 29 Mar 2018 12:15:19 -0400 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Message-ID: You're right, Brian, the key load will be on the filesystem manager in any case, and as you say, all nodes nodes must quiesce - it's not really an issue of where to run the command, like it would be for mmfsck, etc. GPFS version is 3.5.0.26. We'll investigate upgrade to later version that accommodates combined operations. I will also look into the cgroups approach - is this a documented thing, or just something that people have tinkered with/hand rolled? Thanks, Keith On Wed, Mar 28, 2018 at 7:00 AM, > > What version of GPFS are you running Keith? > > All nodes mounting the file system must briefly quiesce I/O operations > during the snapshot create operations, hence the ?Quiescing all file system > operations.? message in the output. So don?t really see a way to specify a > specific set of nodes for these operations. They have made updates in > newer releases of GPFS to combine operations (e.g. create and delete > snapshots at the same time) which IBM says ?system performance is increased > by batching operations and reducing overhead.?. > > Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU > and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can > help them respond more quickly to quiesce I/O requests. > > HTH, > -Bryan > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Keith Ball > Sent: Tuesday, March 27, 2018 5:26 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? > > Note: External Email > ________________________________ > Hi All, > Two questions on snapshots: > 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have > an "-N" option as "PIT" commands typically do. Is there any way to control > where threads for snapshot creation/deletion run? (I assume the filesystem > manager will always be involved regardless). > > 2.) When mmdelsnapshot hangs or times out, the error messages tend to > appear on client nodes, and not necessarily the node where mmdelsnapshot is > run from, not the FS manager. Besides telling all users "don't use any I/O" > when runnign these commands, are there ways that folks have found to avoid > hangs and timeouts of mmdelsnapshot? > FWIW our filesystem manager is probably overextended (replication factor 2 > on data+MD, 30 daily snapshots kept, a number of client clusters served, > plus the FS manager is also an NSD server). > > Many Thanks, > Keith > RedLine Performance Solutions LLC > > ________________________________ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Mar 29 18:33:30 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 29 Mar 2018 17:33:30 +0000 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? In-Reply-To: References: Message-ID: The cgroups are something we moved onto, which has helped a lot with GPFS Clients responding to necessary GPFS commands demanding a low latency response (e.g. mmcrsnapshots, byte range locks, quota reporting, etc). Good luck! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Thursday, March 29, 2018 11:15 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ You're right, Brian, the key load will be on the filesystem manager in any case, and as you say, all nodes nodes must quiesce - it's not really an issue of where to run the command, like it would be for mmfsck, etc. GPFS version is 3.5.0.26. We'll investigate upgrade to later version that accommodates combined operations. I will also look into the cgroups approach - is this a documented thing, or just something that people have tinkered with/hand rolled? Thanks, Keith On Wed, Mar 28, 2018 at 7:00 AM, What version of GPFS are you running Keith? All nodes mounting the file system must briefly quiesce I/O operations during the snapshot create operations, hence the ?Quiescing all file system operations.? message in the output. So don?t really see a way to specify a specific set of nodes for these operations. They have made updates in newer releases of GPFS to combine operations (e.g. create and delete snapshots at the same time) which IBM says ?system performance is increased by batching operations and reducing overhead.?. Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can help them respond more quickly to quiesce I/O requests. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Tuesday, March 27, 2018 5:26 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC ________________________________ ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Mar 30 08:35:33 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 30 Mar 2018 13:05:33 +0530 Subject: [gpfsug-discuss] AFM-DR Questions In-Reply-To: References: Message-ID: + Venkat to provide answers on AFM queries Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Delmar Demarchi" To: gpfsug-discuss at spectrumscale.org Date: 03/29/2018 07:12 PM Subject: [gpfsug-discuss] AFM-DR Questions Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello experts. We have a Scale project with AFM-DR to be implemented and after read the KC documentation, we have some questions about. - Do you know any reason why we changed the Recovery point objective (RPO) snapshots by 15 to 720 minutes in the version 5.0.0 of IBM Spectrum Scale AFM-DR? - Can we use additional Independent Peer-snapshots to reduce the RPO interval (720 minutes) of IBM Spectrum Scale AFM-DR? - In addition to the above question, can we use these snapshots to update the new primary site after a failover occur for the most up to date snapshot? - According to the documentation, we are not able to replicate Dependent filesets, but if these dependents filesets are part of an existing Independent fileset. Do you see any issues/concerns with this? Thank you in advance. Delmar Demarchi .'. (delmard at br.ibm.com)_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=nBTENLroUhlIPgOEVV1rqTmcYxRh7ErhZ7jLWdpprlY&s=V0Xb_-yxttxff7X31CfkaegWKSGc-1ehsXrDpdO5dTI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Mar 30 14:54:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 30 Mar 2018 13:54:01 +0000 Subject: [gpfsug-discuss] Tentative Agenda - SSUG-US Spring Meeting - May 16/17, Cambridge MA Message-ID: Here is the Tentative Agenda for the upcoming SSUG-US meeting. It?s close to final. I do have one (possibly two) spots for customer talks still open. This is a fantastic agenda, and a big thanks to Ulf Troppens at IBM for pulling together all the IBM speakers. Register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist Wednesday, May 16th 8:30 9:00 Registration and Networking 9:00 9:20 Welcome 9:20 9:45 Keynote: Cognitive Computing and Spectrum Scale 9:45 10:10 Spectrum Scale Big Data & Analytics Initiative 10:10 10:30 Customer Talk 10:30 10:45 Break 10:45 11:10 Spectrum Scale Cloud Initiative 11:10 11:35 Composable Infrastructure for Technical Computing 11:35 11:55 Customer Talk 11:55 12:00 Agenda 12:00 13:00 Lunch and Networking 13:00 13:30 What is new in Spectrum Scale 13:30 13:45 What is new in ESS? 13:45 14:15 File System Audit Log 14:15 14:45 Coffee and Networking 14:45 15:15 Lifting the 32 subblock limit 15:15 15:35 Customer Talk 15:35 16:05 Spectrum Scale CCR Internals 16:05 16:20 Break 16:20 16:40 Customer Talk 16:40 17:25 Field Update 17:25 18:15 Meet the Devs - Ask us Anything Evening Networking Event - TBD Thursday, May 17th 8:30 9:00 Kaffee und Networking 9:00 10:00 1) Life Science Track 2) System Health, Performance Monitoring & Call Home 3) Policy Engine Best Practices 10:00 11:00 1) Life Science Track 2) Big Data & Analytics 3) Multi-cloud with Transparent Cloud Tiering 11:00 12:00 1) Life Science Track 2) Cloud Deployments 3) Installation Best Practices 12:00 13:00 Lunch and Networking 13:00 13:20 Customer Talk 13:20 14:10 Network Best Practices 14:10 14:30 Customer Talk 14:30 15:00 Kaffee und Networking 15:00 16:00 Enhancements for CORAL Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Mar 30 17:15:13 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 30 Mar 2018 12:15:13 -0400 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 Message-ID: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem ?without? spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag ? ? ? ? ? ? ? ?value ? ? ? ? ? ? ? ? ? ?description ------------------- ------------------------ ----------------------------------- ?--subblocks-per-full-block 1024 ? ? ? ? ? ? Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Mar 30 17:45:41 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 30 Mar 2018 11:45:41 -0500 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Message-ID: Apparently, a small mistake in that developer works post. I always advise testing of new features on a scratchable system... Here's what I see on my test system: #mmcrfs mak -F /vds/mn.nsd -A no -T /mak -B 16M -f 1K -i 1K Value '1024' for option '-f' is out of range. Valid values are 4096 through 524288. # mmcrfs mak -F /vds/mn.nsd -A no -T /mak -B 16M -f 4K -i 1K (runs okay) # mmlsfs mak flag value description ------------------- ------------------------ ----------------------------------- -f 4096 Minimum fragment (subblock) size in bytes -i 1024 Inode size in bytes -I 32768 Indirect block size in bytes ... -B 16777216 Block size ... -V 18.00 (5.0.0.0) File system version ... --subblocks-per-full-block 4096 Number of subblocks per full block ... From: valleru at cbio.mskcc.org To: gpfsug main discussion list Date: 03/30/2018 12:21 PM Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem ?without? spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag value description ------------------- ------------------------ ----------------------------------- --subblocks-per-full-block 1024 Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=HNrrMTazEN37eiIyxj9LWFMt2v1vCWeYuAGeHXXgIN8&s=Q6RUpDte4cePcCa_VU9ClyOvHMwhOWg8H1sRVLv9ocU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Mar 30 18:47:27 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 30 Mar 2018 13:47:27 -0400 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Message-ID: <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> Thanks Mark, I did not know, we could explicitly mention sub-block size when creating File system. It is no-where mentioned in the ?man mmcrfs?. Is this a new GPFS 5.0 feature? Also, i see from the ?man mmcrfs? that the default sub-block size for 8M and 16M is 16K. +???????????????????????????????+???????????????????????????????+ | Block size ? ? ? ? ? ? ? ? ? ?| Subblock size ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 64 KiB ? ? ? ? ? ? ? ? ? ? ? ?| 2 KiB ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 128 KiB ? ? ? ? ? ? ? ? ? ? ? | 4 KiB ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 256 KiB, 512 KiB, 1 MiB, 2 ? ?| 8 KiB ? ? ? ? ? ? ? ? ? ? ? ? | | MiB, 4 MiB ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 8 MiB, 16 MiB ? ? ? ? ? ? ? ? | 16 KiB ? ? ? ? ? ? ? ? ? ? ? ?| +???????????????????????????????+???????????????????????????????+ And you could create more than 1024 sub-blocks per block? and 4k is size of sub-block for 16M? That is great, since 4K files will go into data pool, and anything less than 4K will go to system (metadata) pool? Do you think - there would be any performance degradation for reducing the sub-blocks to 4K - 8K, from the default 16K for 16M filesystem? If we are not loosing any blocks by choosing a bigger block-size (16M) for filesystem, why would we want to choose a smaller block-size for filesystem (4M)? What advantage would smaller block-size (4M) give, compared to 16M with performance since 16M filesystem could store small files and read small files too at the respective sizes? And Near Line Rotating disks would be happy with bigger block-size than smaller block-size i guess? Regards, Lohit On Mar 30, 2018, 12:45 PM -0400, Marc A Kaplan , wrote: > > subblock -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Mar 30 19:47:47 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 30 Mar 2018 13:47:47 -0500 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> Message-ID: Look at my example, again, closely. I chose the blocksize as 16M and subblock size as 4K and the inodesize as 1K.... Developer works is a good resource, but articles you read there may be incomplete or contain mistakes. The official IBM Spectrum Scale cmd and admin guide documents, are "trustworthy" but may not be perfect in all respects. "Trust but Verify" and YMMV. ;-) As for why/how to choose "good sizes", that depends what objectives you want to achieve, and "optimal" may depend on what hardware you are running. Run your own trials and/or ask performance experts. There are usually "tradeoffs" and OTOH when you get down to it, some choices may not be all-that-important in actual deployment and usage. That's why we have defaults values - try those first and leave the details and tweaking aside until you have good reason ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Thu Mar 1 11:26:12 2018 From: chair at spectrumscale.org (Simon Thompson) Date: Thu, 01 Mar 2018 11:26:12 +0000 Subject: [gpfsug-discuss] UK April meeting Message-ID: <26357FF0-F04B-4A37-A8A5-062CB0160D19@spectrumscale.org> Hi All, We?ve just posted the draft agenda for the UK meeting in April at: http://www.spectrumscaleug.org/event/uk-2018-user-group-event/ So far, we?ve issued over 50% of the available places, so if you are planning to attend, please do register now! Please register at: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList We?ve also confirmed our evening networking/social event between days 1 and 2 with thanks to our sponsors for supporting this. Please remember that we are currently limiting to two registrations per organisation. We?d like to thank our sponsors from DDN, E8, Ellexus, IBM, Lenovo, NEC and OCF for supporting the event. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Thu Mar 1 08:41:59 2018 From: john.hearns at asml.com (John Hearns) Date: Thu, 1 Mar 2018 08:41:59 +0000 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: In reply to Stuart, our setup is entirely Infiniband. We boot and install over IB, and rely heavily on IP over Infiniband. As for users being 'confused' due to multiple IPs, I would appreciate some more depth on that one. Sure, all batch systems are sensitive to hostnames (as I know to my cost!) but once you get that straightened out why should users care? I am not being aggressive, just keen to find out more. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Stuart Barkley Sent: Wednesday, February 28, 2018 6:50 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB The problem with CM is that it seems to require configuring IP over Infiniband. I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, but pulled it out of our environment as adding unneeded complexity. It requires provisioning IP addresses across the Infiniband infrastructure and possibly adding routers to other portions of the IP infrastructure. It was also confusing some users due to multiple IPs on the compute infrastructure. We have recently been in discussions with a vendor about their support for GPFS over IB and they kept directing us to using CM (which still didn't work). CM wasn't necessary once we found out about the actual problem (we needed the undocumented verbsRdmaUseGidIndexZero configuration option among other things due to their use of SR-IOV based virtual IB interfaces). We don't use routed Infiniband and it might be that CM and IPoIB is required for IB routing, but I doubt it. It sounds like the OP is keeping IB and IP infrastructure separate. Stuart Barkley On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > Date: Mon, 26 Feb 2018 14:16:34 > From: Aaron Knister > Reply-To: gpfsug main discussion list > > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > Hi Jan Erik, > > It was my understanding that the IB hardware router required RDMA CM to work. > By default GPFS doesn't use the RDMA Connection Manager but it can be > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > clients/servers (in both clusters) to take effect. Maybe someone else > on the list can comment in more detail-- I've been told folks have > successfully deployed IB routers with GPFS. > > -Aaron > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > Dear all > > > > we are currently trying to remote mount a file system in a routed > > Infiniband test setup and face problems with dropped RDMA > > connections. The setup is the > > following: > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > connected to the same infiniband network. Additionally they are > > connected to a fast ethernet providing ip communication in the network 192.168.11.0/24. > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > are connected to a second infiniband network. These servers have IPs > > on their IB interfaces in the network 192.168.12.0/24. > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > dedicated machine. > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > We tested that the routing, both IP and IB, is working between the > > two clusters without problems and that RDMA is working fine both for > > internal communication inside cluster 1 and cluster 2 > > > > When trying to remote mount a file system from cluster 1 in cluster > > 2, RDMA communication is not working as expected. Instead we see > > error messages on the remote host (cluster 2) > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 2 > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 2 > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 3 > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 3 > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 1 > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 1 > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 1 > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 0 > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 0 > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 0 > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 2 > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 2 > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 2 > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 error 733 index 3 > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 index 3 > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > > > > > and in the cluster with the file system (cluster 1) > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > 129 > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > to > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > Any advice on how to configure the setup in a way that would allow > > the remote mount via routed IB would be very appreciated. > > > > > > Thank you and best regards > > Jan Erik > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > YpqcNNP8%3D&reserved=0 > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > %3D&reserved=0 > -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From lavila at illinois.edu Thu Mar 1 15:02:24 2018 From: lavila at illinois.edu (Avila-Diaz, Leandro) Date: Thu, 1 Mar 2018 15:02:24 +0000 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. [Inactive hide details for "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is]"Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: From bzhang at ca.ibm.com Thu Mar 1 22:47:57 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Thu, 1 Mar 2018 17:47:57 -0500 Subject: [gpfsug-discuss] Spectrum Scale Support Webinar - File Audit Logging Message-ID: You are receiving this message because you are an IBM Spectrum Scale Client and in GPFS User Group. IBM Spectrum Scale Support Webinar File Audit Logging About this Webinar IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices based on various use cases. This webinar will discuss fundamentals of the new File Audit Logging function including configuration and key best practices that will aid you in successful deployment and use of File Audit Logging within Spectrum Scale. Please note that our webinars are free of charge and will be held online via WebEx. Agenda: ? Overview of File Audit Logging ? Installation and deployment of File Audit Logging ? Using File Audit Logging ? Monitoring and troubleshooting File Audit Logging ? Q&A NA/EU Session Date: March 14, 2018 Time: 11 AM ? 12PM EDT (4PM GMT) Registration: https://ibm.biz/BdZsZz Audience: Spectrum Scale Administrators AP/JP Session Date: March 15, 2018 Time: 10AM ? 11AM Beijing Time (11AM Tokyo Time) Registration: https://ibm.biz/BdZsZf Audience: Spectrum Scale Administrators If you have any questions, please contact Robert Simon, Jun Hui Bu, Vlad Spoiala and Bohai Zhang. Regards, IBM Spectrum Scale Support Team Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71100731.jpg Type: image/jpeg Size: 21904 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71151195.jpg Type: image/jpeg Size: 17787 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71943442.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71224521.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71350284.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71371859.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71584384.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 71592777.gif Type: image/gif Size: 1243 bytes Desc: not available URL: From Greg.Lehmann at csiro.au Fri Mar 2 03:48:44 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 2 Mar 2018 03:48:44 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Message-ID: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won't run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Mar 2 05:15:21 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 2 Mar 2018 13:15:21 +0800 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: Hi, The verification/test work is still ongoing. Hopefully GPFS will publish statement soon. I think it would be available through several channels, such as FAQ. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Avila-Diaz, Leandro" To: gpfsug main discussion list Date: 03/01/2018 11:17 PM Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. Inactive hide details for "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is"Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=qFtjLJBRsEewfEfVZBW__Xk8CD9w04bJZpK0sJiCze0&s=LyDrwavwKGQHDl4DVW6-vpW2bjmJBtXrGGcFfDYyI4o&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 19119307.gif Type: image/gif Size: 106 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Fri Mar 2 16:33:46 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Fri, 2 Mar 2018 16:33:46 +0000 Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS In-Reply-To: References: <5D655862-7F60-47F6-8BD2-A5298F73F70F@vanderbilt.edu> Message-ID: <6BBDFC67-D61F-4477-BF8A-1551925AF955@vanderbilt.edu> Hi Leandro, I think the silence in response to your question says a lot, don?t you? :-O IBM has said (on this list, I believe) that the Meltdown / Spectre patches do not impact GPFS functionality. They?ve been silent as to performance impacts, which can and will be taken various ways. In the absence of information from IBM, the approach we have chosen to take is to patch everything except our GPFS servers ? only we (the SysAdmins, oh, and the NSA, of course!) can log in to them, so we feel that the risk of not patching them is minimal. HTHAL? Kevin On Mar 1, 2018, at 9:02 AM, Avila-Diaz, Leandro > wrote: Good morning, Does anyone know if IBM has an official statement and/or perhaps a FAQ document about the Spectre/Meltdown impact on GPFS? Thank you From: > on behalf of IBM Spectrum Scale > Reply-To: gpfsug main discussion list > Date: Thursday, January 4, 2018 at 20:36 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Kevin, The team is aware of Meltdown and Spectre. Due to the late availability of production-ready test patches (they became available today) we started today working on evaluating the impact of applying these patches. The focus would be both on any potential functional impacts (especially to the kernel modules shipped with GPFS) and on the performance degradation which affects user/kernel mode transitions. Performance characterization will be complex, as some system calls which may get invoked often by the mmfsd daemon will suddenly become significantly more expensive because of the kernel changes. Depending on the main areas affected, code changes might be possible to alleviate the impact, by reducing frequency of certain calls, etc. Any such changes will be deployed over time. At this point, we can't say what impact this will have on stability or Performance on systems running GPFS ? until IBM issues an official statement on this topic. We hope to have some basic answers soon. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum athttps://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. "Buterbaugh, Kevin L" ---01/04/2018 01:11:59 PM---Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like m From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 01/04/2018 01:11 PM Subject: [gpfsug-discuss] Meltdown, Spectre, and impacts on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Happy New Year everyone, I?m sure that everyone is aware of Meltdown and Spectre by now ? we, like many other institutions, will be patching for it at the earliest possible opportunity. Our understanding is that the most serious of the negative performance impacts of these patches will be for things like I/O (disk / network) ? given that, we are curious if IBM has any plans for a GPFS update that could help mitigate those impacts? Or is there simply nothing that can be done? If there is a GPFS update planned for this we?d be interested in knowing so that we could coordinate the kernel and GPFS upgrades on our cluster. Thanks? Kevin P.S. The ?Happy New Year? wasn?t intended as sarcasm ? I hope it is a good year for everyone despite how it?s starting out. :-O ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=m7Pdt9KL82CJT_AT-PwkmO3PbHg88-IQ7Jq-dwhDOdY&s=5i66Rx3vse5LcaN4-WlyCwi_TDTOQGQR2-X_XyjbBpw&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Ceec49ab3ce144a81db3d08d57f86b59d%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636555138937139546&sdata=%2FFS%2FQzdMP4d%2Bgf4wCUPR7KOQxIIV6OABoaNrc0ySHdI%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Mar 5 15:01:28 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 5 Mar 2018 15:01:28 +0000 Subject: [gpfsug-discuss] More Details: US Spring Meeting - May 16-17th, Boston Message-ID: A few more details on the Spectrum Scale User Group US meeting. We are still finalizing the agenda, but expect two full days on presentations by IBM, users, and breakout sessions. We?re still looking for user presentations ? please contact me if you would like to present! Or if you have any topics that you?d like to see covered. Dates: Wednesday May 16th and Thursday May 17th Location: IBM Cambridge Innovation Center, One Rogers St , Cambridge, MA 02142-1203 (Near MIT and Boston) https://goo.gl/5oHSKo There are a number of nearby hotels. If you are considering coming, please book early. Boston has good public transport options, so if you book a bit farther out you may get a better price. More details on the agenda and a link to the sign-up coming in a few weeks. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Mar 5 23:49:04 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 5 Mar 2018 15:49:04 -0800 Subject: [gpfsug-discuss] RDMA data from Zimon In-Reply-To: References: Message-ID: <8EB2B774-1640-4AEA-A4ED-2D6DBEC3324E@lbl.gov> Thanks Eric. No one who is a ZIMon developer has jumped up to contradict this, so I?ll go with it :-) Many thanks. This is helpful to understand where the data is coming from and would be a welcome addition to the documentation. Cheers, Kristy > On Feb 15, 2018, at 9:08 AM, Eric Agar wrote: > > Kristy, > > I experimented a bit with this some months ago and looked at the ZIMon source code. I came to the conclusion that ZIMon is reporting values obtained from the IB counters (actually, delta values adjusted for time) and that yes, for port_xmit_data and port_rcv_data, one would need to multiply the values by 4 to make sense of them. > > To obtain a port_xmit_data value, the ZIMon sensor first looks for /sys/class/infiniband//ports//counters_ext/port_xmit_data_64, and if that is not found then looks for /sys/class/infiniband//ports//counters/port_xmit_data. Similarly for other counters/metrics. > > Full disclosure: I am not an IB expert nor a ZIMon developer. > > I hope this helps. > > > Eric M. Agar > agar at us.ibm.com > > > Kristy Kallback-Rose ---02/14/2018 08:47:59 PM---Hi, Can one of the IBMers tell me if port_xmit_data and port_rcv_data from Zimon can be interpreted > > From: Kristy Kallback-Rose > To: gpfsug main discussion list > Date: 02/14/2018 08:47 PM > Subject: [gpfsug-discuss] RDMA data from Zimon > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Hi, > > Can one of the IBMers tell me if port_xmit_data and port_rcv_data from Zimon can be interpreted as RDMA Bytes/sec? Ideally, also how this data is being collected? I?m looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1hlp_monnetworksmetrics.htm > > But then I also look here: https://community.mellanox.com/docs/DOC-2751 > > and see "Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.? So I wasn?t sure if some multiplication by 4 was in order. > > Please advise. > > Cheers, > Kristy_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=zIRb70L9sx_FvvC9IcWVKLOSOOFnx-hIGfjw0kUN7bw&s=D1g4YTG5WeUiHI3rCPr_kkPxbG9V9E-18UGXBeCvfB8&e= > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Mar 6 12:49:26 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 6 Mar 2018 12:49:26 +0000 Subject: [gpfsug-discuss] tscCmdPortRange question Message-ID: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> We are looking at setting a value for tscCmdPortRange so that we can apply firewalls to a small number of GPFS nodes in one of our clusters. The docs don?t give an indication on the number of ports that are required to be in the range. Could anyone make a suggestion on this? It doesn?t appear as a parameter for ?mmchconfig -i?, so I assume that it requires the nodes to be restarted, however I?m not clear if we could do a rolling restart on this? Thanks Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Mar 6 18:48:40 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 6 Mar 2018 18:48:40 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: Thanks for raising this, I was going to ask. The last I heard it was baked into the 5.0 release of Scale but the release notes are eerily quiet on the matter. Would be good to get some input from IBM on this. Richard Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Greg.Lehmann at csiro.au Sent: Friday, March 2, 2018 3:48:44 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Mar 6 18:50:00 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 6 Mar 2018 18:50:00 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Mar 6 17:17:59 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 6 Mar 2018 17:17:59 +0000 Subject: [gpfsug-discuss] mmfind performance Message-ID: Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Mar 6 18:54:47 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 6 Mar 2018 18:54:47 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au>, Message-ID: The sales pitch my colleagues heard suggested it was already in v5.. That's a big shame to hear that we all misunderstood. Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Christof Schmitt Sent: Tuesday, March 6, 2018 6:50:00 PM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: Sent by: gpfsug-discuss-bounces at spectrumscale.org To: Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Tue Mar 6 18:57:32 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Tue, 6 Mar 2018 18:57:32 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: , <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From dod2014 at med.cornell.edu Tue Mar 6 18:23:41 2018 From: dod2014 at med.cornell.edu (Douglas Duckworth) Date: Tue, 6 Mar 2018 13:23:41 -0500 Subject: [gpfsug-discuss] 100G RoCEE and Spectrum Scale Performance Message-ID: Hi We are currently running Spectrum Scale over FDR Infiniband. We plan on upgrading to EDR since I have not really encountered documentation saying to abandon the lower-latency advantage found in Infiniband. Our workloads generally benefit from lower latency. It looks like, ignoring GPFS, EDR still has higher throughput and lower latency when compared to 100G RoCEE. http://sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post149s2-file3.pdf However, I wanted to get feedback on how GPFS performs with 100G Ethernet instead of FDR. Thanks very much! Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Tue Mar 6 19:46:59 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 6 Mar 2018 20:46:59 +0100 Subject: [gpfsug-discuss] tscCmdPortRange question In-Reply-To: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> References: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> Message-ID: An HTML attachment was scrubbed... URL: From knop at us.ibm.com Tue Mar 6 23:11:38 2018 From: knop at us.ibm.com (Felipe Knop) Date: Tue, 6 Mar 2018 18:11:38 -0500 Subject: [gpfsug-discuss] tscCmdPortRange question In-Reply-To: References: <95B22F2C-F59C-4271-9528-16BEBCA179C8@bham.ac.uk> Message-ID: Olaf, Correct. mmchconfig -i is accepted for tscCmdPortRange . The change should take place immediately, upon invocation of the next command. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Olaf Weiser" To: gpfsug main discussion list Date: 03/06/2018 02:47 PM Subject: Re: [gpfsug-discuss] tscCmdPortRange question Sent by: gpfsug-discuss-bounces at spectrumscale.org this parameter is just for administrative commands.. "where" to send the output of a command... and for those admin ports .. so called ephemeral ports... it depends , how much admin commands ( = sessions = sockets) you want to run in parallel in my experience.. 10 ports is more than enough we use those in a range from 50000-50010 to be clear .. demon - to - demon .. communication always uses 1191 cheers From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 03/06/2018 06:55 PM Subject: [gpfsug-discuss] tscCmdPortRange question Sent by: gpfsug-discuss-bounces at spectrumscale.org We are looking at setting a value for tscCmdPortRange so that we can apply firewalls to a small number of GPFS nodes in one of our clusters. The docs don?t give an indication on the number of ports that are required to be in the range. Could anyone make a suggestion on this? It doesn?t appear as a parameter for ?mmchconfig -i?, so I assume that it requires the nodes to be restarted, however I?m not clear if we could do a rolling restart on this? Thanks Simon_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=pezsJOeWDWSnEkh5d3dp175Vx4opvikABgoTzUt-9pQ&s=S_Qe62jYseR2Y2yjiovXwvVz3d2SFW-jCf0Pw5VB_f4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Mar 6 22:27:34 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 6 Mar 2018 17:27:34 -0500 Subject: [gpfsug-discuss] mmfind performance In-Reply-To: References: Message-ID: Please try: mmfind --polFlags '-N a_node_list -g /gpfs23/tmp' directory find-flags ... Where a_node_list is a node list of your choice and /gpfs23/tmp is a temp directory of your choice... And let us know how that goes. Also, you have chosen a special case, just looking for some inode numbers -- so find can skip stating the other inodes... whereas mmfind is not smart enough to do that -- but still with parallelism, I'd guess mmapplypolicy might still beat find in elapsed time to complete, even for this special case. -- Marc K of GPFS From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 03/06/2018 01:52 PM Subject: [gpfsug-discuss] mmfind performance Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=48WYhVkWI1kr_BM-Wg_VaXEOi7xfGusnZcJtkiA98zg&s=IXUhEC_thuGAVwGJ02oazCCnKEuAdGeg890fBelP4kE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Wed Mar 7 01:30:14 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 6 Mar 2018 20:30:14 -0500 Subject: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory In-Reply-To: <65453649-77df-2efa-8776-eb2775ca9efa@nasa.gov> References: <65453649-77df-2efa-8776-eb2775ca9efa@nasa.gov> Message-ID: Following up on this... On one of the nodes on which I'd bounced the pagepool around I managed to cause what appeared to that node as filesystem corruption (i/o errors and fsstruct errors) on every single fs. Thankfully none of the other nodes in the cluster seemed to agree that the fs was corrupt. I'll open a PMR on that but I thought it was interesting none the less. I haven't run an fsck on any of the filesystems but my belief is that they're OK since so far none of the other nodes in the cluster have complained. Secondly, I can see the pagepool allocations that align with registered verbs mr's (looking at mmfsadm dump verbs). In theory one can free an ib mr after registration as long as it's not in use but one has to track that and I could see that being a tricky thing (although in theory given the fact that GPFS has its own page allocator it might be relatively trivial to figure it out but it might also require re-establishing RDMA connections depending on whether or not a given QP is associated with a PD that uses the MR trying to be freed...I think that makes sense). Anyway, I'm wondering if the need to free the ib MR on pagepool shrink could be avoided all together by limiting the amount of memory that gets allocated to verbs MR's (e.g. something like verbsPagePoolMaxMB) so that those regions never need to be freed but the amount of memory available for user caching could grow and shrink as required. It's probably not that simple, though :) Another thought I had was doing something like creating a file in /dev/shm, registering it as a loopback device, and using that as an LROC device. I just don't think that's feasible at scale given the current method of LROC device registration (e.g. via the mmsdrfs file). I think there's much to be gained from the ability to dynamically change the memory-based file cache size on a per-job basis so I'm really hopeful we can find a way to make this work. -Aaron On 2/25/18 11:45 AM, Aaron Knister wrote: > Hmm...interesting. It sure seems to try :) > > The pmap command was this: > > pmap $(pidof mmfsd) | sort -n -k3 | tail > > -Aaron > > On 2/23/18 9:35 AM, IBM Spectrum Scale wrote: >> AFAIK you can increase the pagepool size dynamically but you cannot >> shrink it dynamically. ?To shrink it you must restart the GPFS daemon. >> Also, could you please provide the actual pmap commands you executed? >> >> Regards, The Spectrum Scale (GPFS) team >> >> ------------------------------------------------------------------------------------------------------------------ >> >> If you feel that your question can benefit other users of ?Spectrum >> Scale (GPFS), then please post it to the public IBM developerWroks >> Forum at >> https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please >> contact ??1-800-237-5511 in the United States or your local IBM >> Service Center in other countries. >> >> The forum is informally monitored as time permits and should not be >> used for priority messages to the Spectrum Scale (GPFS) team. >> >> >> >> From: Aaron Knister >> To: >> Date: 02/22/2018 10:30 PM >> Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all memory >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ------------------------------------------------------------------------ >> >> >> >> This is also interesting (although I don't know what it really means). >> Looking at pmap run against mmfsd I can see what happens after each step: >> >> # baseline >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020000000000 1048576K 1048576K 1048576K 1048576K ? ? ?0K rwxp [anon] >> Total: ? ? ? ? ? 1613580K 1191020K 1189650K 1171836K ? ? ?0K >> >> # tschpool 64G >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020000000000 67108864K 67108864K 67108864K 67108864K ?0K rwxp [anon] >> Total: ? ? ? ? ? 67706636K 67284108K 67282625K 67264920K ? ? ?0K >> >> # tschpool 1G >> 00007fffe4639000 ?59164K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 00007fffd837e000 ?61960K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K ---p [anon] >> 0000020001400000 139264K 139264K 139264K 139264K ? ? ?0K rwxp [anon] >> 0000020fc9400000 897024K 897024K 897024K 897024K ? ? ?0K rwxp [anon] >> 0000020009c00000 66052096K ? ? ?0K ? ? ?0K ? ? ?0K ? ? ?0K rwxp [anon] >> Total: ? ? ? ? ? 67706636K 1223820K 1222451K 1204632K ? ? ?0K >> >> Even though mmfsd has that 64G chunk allocated there's none of it >> *used*. I wonder why Linux seems to be accounting it as allocated. >> >> -Aaron >> >> On 2/22/18 10:17 PM, Aaron Knister wrote: >> ?> I've been exploring the idea for a while of writing a SLURM SPANK >> plugin >> ?> to allow users to dynamically change the pagepool size on a node. >> Every >> ?> now and then we have some users who would benefit significantly from a >> ?> much larger pagepool on compute nodes but by default keep it on the >> ?> smaller side to make as much physmem available as possible to batch >> work. >> ?> >> ?> In testing, though, it seems as though reducing the pagepool doesn't >> ?> quite release all of the memory. I don't really understand it because >> ?> I've never before seen memory that was previously resident become >> ?> un-resident but still maintain the virtual memory allocation. >> ?> >> ?> Here's what I mean. Let's take a node with 128G and a 1G pagepool. >> ?> >> ?> If I do the following to simulate what might happen as various jobs >> ?> tweak the pagepool: >> ?> >> ?> - tschpool 64G >> ?> - tschpool 1G >> ?> - tschpool 32G >> ?> - tschpool 1G >> ?> - tschpool 32G >> ?> >> ?> I end up with this: >> ?> >> ?> mmfsd thinks there's 32G resident but 64G virt >> ?> # ps -o vsz,rss,comm -p 24397 >> ?> ??? VSZ?? RSS COMMAND >> ?> 67589400 33723236 mmfsd >> ?> >> ?> however, linux thinks there's ~100G used >> ?> >> ?> # free -g >> ?> total?????? used free???? shared??? buffers cached >> ?> Mem:?????????? 125 100???????? 25 0????????? 0 0 >> ?> -/+ buffers/cache: 98???????? 26 >> ?> Swap: 7????????? 0 7 >> ?> >> ?> I can jump back and forth between 1G and 32G *after* allocating 64G >> ?> pagepool and the overall amount of memory in use doesn't balloon but I >> ?> can't seem to shed that original 64G. >> ?> >> ?> I don't understand what's going on... :) Any ideas? This is with Scale >> ?> 4.2.3.6. >> ?> >> ?> -Aaron >> ?> >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=OrZQeEmI6chBdguG-h4YPHsxXZ4gTU3CtIuN4e3ijdY&s=hvVIRG5kB1zom2Iql2_TOagchsgl99juKiZfJt5S1tM&e= >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Greg.Lehmann at csiro.au Tue Mar 6 23:36:12 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Tue, 6 Mar 2018 23:36:12 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 7 13:45:24 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 7 Mar 2018 13:45:24 +0000 Subject: [gpfsug-discuss] mmfind performance Message-ID: <90F48570-7294-4032-8A6A-73DD51169A55@bham.ac.uk> I can?t comment on mmfind vs perl, but have you looked at trying ?tsfindinode? ? Simon From: on behalf of "Buterbaugh, Kevin L" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 6 March 2018 at 18:52 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] mmfind performance Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Mar 7 15:18:24 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 7 Mar 2018 15:18:24 +0000 Subject: [gpfsug-discuss] mmfind performance In-Reply-To: References: Message-ID: Hi Marc, Thanks, I?m going to give this a try as the first mmfind finally finished overnight, but produced no output: /root root at gpfsmgrb# bash -x ~/bin/klb.sh + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls /root root at gpfsmgrb# BTW, I had put that in a simple script simply because I had a list of those inodes and it was easier for me to get that in the format I wanted via a script that I was editing than trying to do that on the command line. However, in the log file it was producing it ?hit? on 48 files: [I] Inodes scan: 978275821 files, 99448202 directories, 37189547 other objects, 1967508 'skipped' files and/or errors. [I] 2018-03-06 at 23:43:15.988 Policy evaluation. 1114913570 files scanned. [I] 2018-03-06 at 23:43:16.016 Sorting 48 candidate file list records. [I] 2018-03-06 at 23:43:16.040 Sorting 48 candidate file list records. [I] 2018-03-06 at 23:43:16.065 Choosing candidate files. 0 records scanned. [I] 2018-03-06 at 23:43:16.066 Choosing candidate files. 48 records scanned. [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 48 1274453504 48 1274453504 0 RULE 'mmfind' LIST 'mmfindList' DIRECTORIES_PLUS SHOW(.) WHERE(.) [I] Filesystem objects with no applicable rules: 1112946014. [I] GPFS Policy Decisions and File Choice Totals: Chose to list 1274453504KB: 48 of 48 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 564722407424 624917749760 90.367477583% gpfs23data 304797672448 531203506176 57.378701177% system 0 0 0.000000000% (no user data) [I] 2018-03-06 at 23:43:16.066 Policy execution. 0 files dispatched. [I] 2018-03-06 at 23:43:16.102 Policy execution. 0 files dispatched. [I] A total of 0 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. While I?m going to follow your suggestion next, if you (or anyone else on the list) can explain why the ?Hit_Cnt? is 48 but the ?-ls? I passed to mmfind didn?t result in anything being listed, my curiosity is piqued. And I?ll go ahead and say it before someone else does ? I haven?t just chosen a special case, I AM a special case? ;-) Kevin On Mar 6, 2018, at 4:27 PM, Marc A Kaplan > wrote: Please try: mmfind --polFlags '-N a_node_list -g /gpfs23/tmp' directory find-flags ... Where a_node_list is a node list of your choice and /gpfs23/tmp is a temp directory of your choice... And let us know how that goes. Also, you have chosen a special case, just looking for some inode numbers -- so find can skip stating the other inodes... whereas mmfind is not smart enough to do that -- but still with parallelism, I'd guess mmapplypolicy might still beat find in elapsed time to complete, even for this special case. -- Marc K of GPFS From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 03/06/2018 01:52 PM Subject: [gpfsug-discuss] mmfind performance Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, In the README for the mmfind command it says: mmfind A highly efficient file system traversal tool, designed to serve as a drop-in replacement for the 'find' command as used against GPFS FSes. And: mmfind is expected to be slower than find on file systems with relatively few inodes. This is due to the overhead of using mmapplypolicy. However, if you make use of the -exec flag to carry out a relatively expensive operation on each file (e.g. compute a checksum), using mmfind should yield a significant performance improvement, even on a file system with relatively few inodes. I have a list of just shy of 50 inode numbers that I need to figure out what file they correspond to, so I decided to give mmfind a try: + cd /usr/lpp/mmfs/samples/ilm + ./mmfind /gpfs23 -inum 113769917 -o -inum 132539418 -o -inum 135584191 -o -inum 136471839 -o -inum 137009371 -o -inum 137314798 -o -inum 137939675 -o -inum 137997971 -o -inum 138013736 -o -inum 138029061 -o -inum 138029065 -o -inum 138029076 -o -inum 138029086 -o -inum 138029093 -o -inum 138029099 -o -inum 138029101 -o -inum 138029102 -o -inum 138029106 -o -inum 138029112 -o -inum 138029113 -o -inum 138029114 -o -inum 138029119 -o -inum 138029120 -o -inum 138029121 -o -inum 138029130 -o -inum 138029131 -o -inum 138029132 -o -inum 138029141 -o -inum 138029146 -o -inum 138029147 -o -inum 138029152 -o -inum 138029153 -o -inum 138029154 -o -inum 138029163 -o -inum 138029164 -o -inum 138029165 -o -inum 138029174 -o -inum 138029175 -o -inum 138029176 -o -inum 138083075 -o -inum 138083148 -o -inum 138083149 -o -inum 138083155 -o -inum 138216465 -o -inum 138216483 -o -inum 138216507 -o -inum 138216535 -o -inum 138235320 -ls I kicked that off last Friday and it is _still_ running. By comparison, I have a Perl script that I have run in the past that simple traverses the entire filesystem tree and stat?s each file and outputs that to a log file. That script would ?only? run ~24 hours. Clearly mmfind as I invoked it is much slower than the corresponding Perl script, so what am I doing wrong? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=48WYhVkWI1kr_BM-Wg_VaXEOi7xfGusnZcJtkiA98zg&s=IXUhEC_thuGAVwGJ02oazCCnKEuAdGeg890fBelP4kE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C724521c8034241913d8508d58412dcf8%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636560138922366489&sdata=faXozQ%2FGGDf8nARmk52%2B2W5eIEBfnYwNapJyH%2FagqIQ%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Mar 7 16:48:40 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 7 Mar 2018 17:48:40 +0100 Subject: [gpfsug-discuss] 100G RoCEE and Spectrum Scale Performance In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Mar 7 19:15:59 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 7 Mar 2018 14:15:59 -0500 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Wed Mar 7 21:53:34 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Wed, 7 Mar 2018 21:53:34 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Mar 8 09:41:56 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 8 Mar 2018 09:41:56 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: Whether or not you meant it your words ?that is not available today.? Implies that something is coming in the future? Would you be reliant on the Samba/CTDB development team or would you roll your own.. supposing it?s possible in the first place. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: 07 March 2018 21:54 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades The problem with the SMB upgrade is with the data shared between the protocol nodes. It is not tied to the protocol version used between SMB clients and the protocol nodes. Samba stores internal data (e.g. for the SMB state of open files) in tdb database files. ctdb then makes these tdb databases available across all protocol nodes. A concurrent upgrade for SMB would require correct handling of ctdb communications and the tdb records across multiple versions; that is not available today. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Wed, Mar 7, 2018 4:35 AM In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=7bQRypv0JL7swEYobmepaynWFvV8HYtBa2_S1kAyirk&s=2bUeeDk-8VbtwU8KtV4RlENEcbpOr_GQJQvR_8gt-ug&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Thu Mar 8 08:29:56 2018 From: john.hearns at asml.com (John Hearns) Date: Thu, 8 Mar 2018 08:29:56 +0000 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: On the subject of mmfind, I would like to find files which have the misc attributes relevant to AFM. For instance files which have the attribute 'v' The file is newly created, not yet copied to home I can write a policy to do this, and I have a relevant policy written. However I would like to do this using mmfind, which seems a nice general utility. This syntax does not work: mmfind /hpc -eaWithValue MISC_ATTRIBUTES===v Before anyone says it, I am mixing up MISC_ATTRIBUTES and extended attributes! My question really is - has anyone done this sort of serch using mmfind? Thankyou From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, March 07, 2018 8:16 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfind -ls and so forth As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Mar 8 11:10:24 2018 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 8 Mar 2018 11:10:24 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Message-ID: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Thu Mar 8 12:33:41 2018 From: david_johnson at brown.edu (david_johnson at brown.edu) Date: Thu, 8 Mar 2018 07:33:41 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active In-Reply-To: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> Message-ID: <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Mar 8 12:42:47 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 8 Mar 2018 07:42:47 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive In-Reply-To: <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Mar 8 13:59:27 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 8 Mar 2018 08:59:27 -0500 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: (John Hearns, et. al.) Some minor script hacking would be the easiest way add test(s) for other MISC_ATTRIBUTES Notice mmfind concentrates on providing the most popular classic(POSIX) and Linux predicates, BUT also adds a few gpfs specific predicates (mmfind --help show you these) -ea -eaWithValue -gpfsImmut -gpfsAppOnly Look at the implementation of -gpfsImmut in tr_findToPol.pl ... sub tr_gpfsImmut{ return "( /* -gpfsImmut */ MISC_ATTRIBUTES LIKE '%X%')"; } So easy to extend this for any or all the others.... True it's perl, but you don't have to be a perl expert to cut-paste-hack another predicate into the script. Let us know how you make out with this... Perhaps we shall add a general predicate -gpfsMiscAttrLike '...' to the next version... -- Marc K of GPFS From: John Hearns To: gpfsug main discussion list Date: 03/08/2018 04:59 AM Subject: Re: [gpfsug-discuss] mmfind -ls and so forth Sent by: gpfsug-discuss-bounces at spectrumscale.org On the subject of mmfind, I would like to find files which have the misc attributes relevant to AFM. For instance files which have the attribute ?v? The file is newly created, not yet copied to home I can write a policy to do this, and I have a relevant policy written. However I would like to do this using mmfind, which seems a nice general utility. This syntax does not work: mmfind /hpc -eaWithValue MISC_ATTRIBUTES===v Before anyone says it, I am mixing up MISC_ATTRIBUTES and extended attributes! My question really is ? has anyone done this sort of serch using mmfind? Thankyou From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, March 07, 2018 8:16 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfind -ls and so forth As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=LDC-t-w-jkuH2fJZ1lME_JUjzABDz3y90ptTlYWM3rc&s=xrFd1LD5dWq9GogfeOGs9ZCtqoptErjmGfJzD3eXhz4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Mar 8 15:16:10 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 8 Mar 2018 15:16:10 +0000 Subject: [gpfsug-discuss] mmfind -ls and so forth In-Reply-To: References: Message-ID: <8D4EED0B-A9F8-46FB-8BA2-359A3CF1C630@vanderbilt.edu> Hi Marc, I test in production ? just kidding. But - not kidding - I did read the entire mmfind.README, compiled the binary as described therein, and read the output of ?mmfind -h?. But what I forgot was that when you run a bash shell script with ?bash -x? it doesn?t show you the redirection you did to a file ? and since the mmfind ran for ~5 days, including over a weekend, and including Monday which I took off from work to have our 16 1/2 year old Siberian Husky put to sleep, I simply forgot that in the script itself I had redirected the output to a file. Stupid of me, I know, but unlike Delusional Donald, I?ll admit my stupid mistakes. Thanks, and sorry. I will try the mmfind as you suggested in your previous response the next time I need to run one to see if that significantly improves the performance? Kevin On Mar 7, 2018, at 1:15 PM, Marc A Kaplan > wrote: As always when dealing with computers and potentially long running jobs, run a test on a handful of files first, so you can rapidly debug. Did you read the mmfind.README ? It mentions...that this sample utility "some user assembly required..." ... mmfindUtil_processOutputFile.c A utility to parse the "list file" produced by mmapplypolicy and to print it in a find-compatible format mmfind invokes it once mmapplypolicy begins to populate the "list file" mmfindUtil_processOutputFile.sampleMakefile copy to 'makefile', modify as needed, and run 'make' to compile mmfindUtil_processOutputFile.c This should produce a binary called mmfindUtil_processOutputFile mmfind will not be able to run until this utility has been compiled on the node from which you launch mmfind. Works for me... [root at n2 ilm]# ./mmfind /goo/zdbig -ls 2463649 256 drwxr-xr-x 2 root root 262144 Feb 9 11:41 /goo/zdbig 6804497 0 -rw-r--r-- 1 root root 0 Feb 9 11:41 /goo/zdbig/xy _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c170869f3294124be3608d5845fdecc%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636560469687764985&sdata=yNvpm34DY0AtEm2Y4OIMll5IW1v5kP3X3vHx3sQ%2B8Rs%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From oluwasijibomi.saula at ndsu.edu Thu Mar 8 15:06:03 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Thu, 8 Mar 2018 15:06:03 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Message-ID: Hi Folks, As this is my first post to the group, let me start by saying I applaud the commentary from the user group as it has been a resource to those of us watching from the sidelines. That said, we have a GPFS layered on IPoIB, and recently, we started having some issues on our IB FDR fabric which manifested when GPFS began sending persistent expel messages to particular nodes. Shortly after, we embarked on a tuning exercise using IBM tuning recommendations but this page is quite old and we've run into some snags, specifically with setting 4k MTUs using mlx4_core/mlx4_en module options. While setting 4k MTUs as the guide recommends is our general inclination, I'd like to solicit some advice as to whether 4k MTUs are a good idea and any hitch-free steps to accomplishing this. I'm getting some conflicting remarks from Mellanox support asking why we'd want to use 4k MTUs with Unreliable Datagram mode. Also, any pointers to best practices or resources for network configurations for heavy I/O clusters would be much appreciated. Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Thu Mar 8 17:37:12 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Thu, 8 Mar 2018 17:37:12 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Thu Mar 8 21:50:11 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Thu, 8 Mar 2018 21:50:11 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: Message-ID: <1520545811808.33125@UTSouthwestern.edu> Hi, Saula, Can the expelled node and expelling node ping each other? We expanded our gpfs IB network from /24 to /20 but some clients still used /24, they cannot talk to the added new clients using /20 and expelled the new clients persistently. Changing the netmask all to /20 works out. FYI. Wei Guo HPC Administartor UT Southwestern Medical Center wei1.guo at utsouthwestern.edu ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Thursday, March 8, 2018 11:37 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 74, Issue 17 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Thoughts on GPFS on IB & MTU sizes (Saula, Oluwasijibomi) 2. Re: wondering about outage free protocols upgrades (Christof Schmitt) ---------------------------------------------------------------------- Message: 1 Date: Thu, 8 Mar 2018 15:06:03 +0000 From: "Saula, Oluwasijibomi" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Message-ID: Content-Type: text/plain; charset="windows-1252" Hi Folks, As this is my first post to the group, let me start by saying I applaud the commentary from the user group as it has been a resource to those of us watching from the sidelines. That said, we have a GPFS layered on IPoIB, and recently, we started having some issues on our IB FDR fabric which manifested when GPFS began sending persistent expel messages to particular nodes. Shortly after, we embarked on a tuning exercise using IBM tuning recommendations but this page is quite old and we've run into some snags, specifically with setting 4k MTUs using mlx4_core/mlx4_en module options. While setting 4k MTUs as the guide recommends is our general inclination, I'd like to solicit some advice as to whether 4k MTUs are a good idea and any hitch-free steps to accomplishing this. I'm getting some conflicting remarks from Mellanox support asking why we'd want to use 4k MTUs with Unreliable Datagram mode. Also, any pointers to best practices or resources for network configurations for heavy I/O clusters would be much appreciated. Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 8 Mar 2018 17:37:12 +0000 From: "Christof Schmitt" To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 74, Issue 17 ********************************************** ________________________________ UT Southwestern Medical Center The future of medicine, today. From Greg.Lehmann at csiro.au Fri Mar 9 00:23:10 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 9 Mar 2018 00:23:10 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <2b7547fd8aec467a958d8e10e88bd1e6@exch1-cdc.nexus.csiro.au> That last little bit ?not available today? gives me hope. It would be nice to get there ?one day.? Our situation is we are using NFS for access to images that VMs run from. An outage means shutting down a lot of guests. An NFS outage of even short duration would result in the system disks of VMs going read only due to IO timeouts. Cheers, Greg From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Thursday, 8 March 2018 7:54 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades The problem with the SMB upgrade is with the data shared between the protocol nodes. It is not tied to the protocol version used between SMB clients and the protocol nodes. Samba stores internal data (e.g. for the SMB state of open files) in tdb database files. ctdb then makes these tdb databases available across all protocol nodes. A concurrent upgrade for SMB would require correct handling of ctdb communications and the tdb records across multiple versions; that is not available today. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Wed, Mar 7, 2018 4:35 AM In theory it only affects SMB, but in practice if NFS depends on winbind for authorisation then it is affected too. I can understand the need for changes to happen every so often and that maybe outages will be required then. But, I would like to see some effort to avoid doing this unnecessarily. IBM, please consider my suggestion. The message I get from the ctdb service implies it is the sticking point. Can some consideration be given to keeping the ctdb version compatible between releases? Christof, you are saying something about the SMB service version compatibility. I am unclear as to whether you are talking about the Spectrum Scale Protocols SMB service or the default samba SMB over the wire protocol version being used to communicate between client and server. If the latter, is it possible to peg the version to the older version manually while doing the upgrade so that all nodes can be updated? You can then take an outage at a later time to update the over the wire version. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: Wednesday, 7 March 2018 4:50 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] wondering about outage free protocols upgrades Hi, at this point there are no plans to support "node by node" upgrade for SMB. Some background: The technical reason for this restriction is that the records shared between protocol nodes for the SMB service (ctdb and Samba) are not versioned and no mechanism is in place to handle different versions. Changing this would be a large development task that has not been included in any current plans. Note that this only affects the SMB service and that the knowledge center outlines a procedure to minimize the outage, by getting half of the protocol nodes ready with the new Samba version and then only taking a brief outage when switching from the "old" to the "new" Samba version: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_updatingsmb.htm The toolkit follows the same approach during an upgrade to minimize the outage. We know that this is not ideal, but as mentioned above this is limited by the large effort that would be required which has to be weighed against other requirements and priorities. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: > Cc: Subject: [gpfsug-discuss] wondering about outage free protocols upgrades Date: Tue, Mar 6, 2018 10:19 AM Hi All, It appears a rolling node by node upgrade of a protocols cluster is not possible. Ctdb is the sticking point as it won?t run with 2 different versions at the same time. Are there any plans to address this and make it a real Enterprise product? Cheers, Greg _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=p5fg7X1tKGwi1BsYiw-wHTxmaG-PLihwHV0yTBQNaUs&s=3ZHS5vAoxeC6ikuOpTRLWNTpvgKEC3thI-qUgyU_hYo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=7bQRypv0JL7swEYobmepaynWFvV8HYtBa2_S1kAyirk&s=2bUeeDk-8VbtwU8KtV4RlENEcbpOr_GQJQvR_8gt-ug&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From willi.engeli at id.ethz.ch Fri Mar 9 12:21:27 2018 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Fri, 9 Mar 2018 12:21:27 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 Message-ID: Hello Group, I?ve just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. The protocol also brings important corrections and enhancements with it. I would like to ask you all very kindly to vote for this RFE please. You find it here: https://www.ibm.com/developerworks/rfe/execute Headline:NFS V4.1 Support ID:117398 Freundliche Gr?sse Willi Engeli ETH Zuerich ID Speicherdienste Weinbergstrasse 11 WEC C 18 8092 Zuerich Tel: +41 44 632 02 69 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From jonathan.buzzard at strath.ac.uk Fri Mar 9 12:37:22 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 09 Mar 2018 12:37:22 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: References: <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au> , <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: <1520599042.1554.1.camel@strath.ac.uk> On Thu, 2018-03-08 at 09:41 +0000, Sobey, Richard A wrote: > Whether or not you meant it your words ?that is not available today.? > Implies that something is coming in the future? Would you be reliant > on the Samba/CTDB development team or would you roll your own.. > supposing it?s possible in the first place. > ? Back in the day when one had to roll your own Samba for this stuff, rolling Samba upgrades worked. What changed or was it never supported? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From stijn.deweirdt at ugent.be Fri Mar 9 12:42:50 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 9 Mar 2018 13:42:50 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> hi all, i would second this request to upvote this. the fact that 4.1 support was dropped in a subsubminor update (4.2.3.5 to 4.3.26 afaik) was already pretty bad to discover, but at the very least there should be an option to reenable it. i'm also interested why this was removed (or actively prevented to enable). i can understand that eg pnfs is not support, bu basic protocol features wrt HA are a must have. only with 4.1 are we able to do ces+ganesha failover without IO error, something that should be basic feature nowadays. stijn On 03/09/2018 01:21 PM, Engeli Willi (ID SD) wrote: > Hello Group, > > I?ve just created a request for enhancement (RFE) to have ganesha supporting > NFS V4.1. > > It is important, to have this new Protocol version supported, since our > Linux clients default support is more then 80% based in this version by > default and Linux distributions are actively pushing this Protocol. > > The protocol also brings important corrections and enhancements with it. > > > > I would like to ask you all very kindly to vote for this RFE please. > > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > > > > Freundliche Gr?sse > > > > Willi Engeli > > ETH Zuerich > > ID Speicherdienste > > Weinbergstrasse 11 > > WEC C 18 > > 8092 Zuerich > > > > Tel: +41 44 632 02 69 > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From Marcelo.Garcia at EMEA.NEC.COM Fri Mar 9 12:51:22 2018 From: Marcelo.Garcia at EMEA.NEC.COM (Marcelo Garcia) Date: Fri, 9 Mar 2018 12:51:22 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: Hi I got the following error when trying the URL below: {e: 'Exception usecase string is null'} Regards mg. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) Sent: Freitag, 9. M?rz 2018 13:21 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 Hello Group, I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. The protocol also brings important corrections and enhancements with it. I would like to ask you all very kindly to vote for this RFE please. You find it here: https://www.ibm.com/developerworks/rfe/execute Headline:NFS V4.1 Support ID:117398 Freundliche Gr?sse Willi Engeli ETH Zuerich ID Speicherdienste Weinbergstrasse 11 WEC C 18 8092 Zuerich Tel: +41 44 632 02 69 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Fri Mar 9 14:09:59 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Fri, 9 Mar 2018 15:09:59 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: Message-ID: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> hi marcelo, can you try https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117398 stijn On 03/09/2018 01:51 PM, Marcelo Garcia wrote: > Hi > > I got the following error when trying the URL below: > {e: 'Exception usecase string is null'} > > Regards > > mg. > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) > Sent: Freitag, 9. M?rz 2018 13:21 > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 > > Hello Group, > I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. > It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. > The protocol also brings important corrections and enhancements with it. > > I would like to ask you all very kindly to vote for this RFE please. > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > Freundliche Gr?sse > > Willi Engeli > ETH Zuerich > ID Speicherdienste > Weinbergstrasse 11 > WEC C 18 > 8092 Zuerich > > Tel: +41 44 632 02 69 > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From Marcelo.Garcia at EMEA.NEC.COM Fri Mar 9 14:11:35 2018 From: Marcelo.Garcia at EMEA.NEC.COM (Marcelo Garcia) Date: Fri, 9 Mar 2018 14:11:35 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> References: <9091b9ef-b734-bfe0-fa19-57559b5866cf@ugent.be> Message-ID: Hi stijn Now it's working. Cheers m. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Stijn De Weirdt Sent: Freitag, 9. M?rz 2018 15:10 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 hi marcelo, can you try https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117398 stijn On 03/09/2018 01:51 PM, Marcelo Garcia wrote: > Hi > > I got the following error when trying the URL below: > {e: 'Exception usecase string is null'} > > Regards > > mg. > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Engeli Willi (ID SD) > Sent: Freitag, 9. M?rz 2018 13:21 > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 > > Hello Group, > I've just created a request for enhancement (RFE) to have ganesha supporting NFS V4.1. > It is important, to have this new Protocol version supported, since our Linux clients default support is more then 80% based in this version by default and Linux distributions are actively pushing this Protocol. > The protocol also brings important corrections and enhancements with it. > > I would like to ask you all very kindly to vote for this RFE please. > You find it here: https://www.ibm.com/developerworks/rfe/execute > > Headline:NFS V4.1 Support > > ID:117398 > > > Freundliche Gr?sse > > Willi Engeli > ETH Zuerich > ID Speicherdienste > Weinbergstrasse 11 > WEC C 18 > 8092 Zuerich > > Tel: +41 44 632 02 69 > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Click https://www.mailcontrol.com/sr/NavEVlEkpX3GX2PQPOmvUqrlA1!9RTN2ec8I4RU35plgh6Q4vQM4vfVPrCpIvwaSEkP!v72X8H9IWrzEXY2ZCw== to report this email as spam. From ewahl at osc.edu Fri Mar 9 14:19:10 2018 From: ewahl at osc.edu (Edward Wahl) Date: Fri, 9 Mar 2018 09:19:10 -0500 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: Message-ID: <20180309091910.0334604a@osc.edu> Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From christof.schmitt at us.ibm.com Fri Mar 9 16:16:41 2018 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 9 Mar 2018 16:16:41 +0000 Subject: [gpfsug-discuss] wondering about outage free protocols upgrades In-Reply-To: <1520599042.1554.1.camel@strath.ac.uk> References: <1520599042.1554.1.camel@strath.ac.uk>, <766f8a8ce5f84164af13f2361ef3b3c3@exch1-cdc.nexus.csiro.au>, <5912dbafbb044700ba723195a0e5e2f9@exch1-cdc.nexus.csiro.au> Message-ID: An HTML attachment was scrubbed... URL: From oluwasijibomi.saula at ndsu.edu Sat Mar 10 14:29:33 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Sat, 10 Mar 2018 14:29:33 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: <20180309091910.0334604a@osc.edu> References: , <20180309091910.0334604a@osc.edu> Message-ID: Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This experience prompted my tunning exercise on the node and has since resolved the expel messages to node even during times of high I/O activity. Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS may not be optimal. To answer your questions, Ed - IB supports both administrative and daemon communications, and we have verbsRdma configured. Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've been told this should not be the case. I'm hoping to settle on 4096 MTU nodes for the entire cluster but I fear there may be some caveats - any thoughts on this? (Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career with NDSU but he left us shortly after. Maybe like you I can tune up my Japanese as well once my GPFS issues are put to rest! ? ) Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: Edward Wahl Sent: Friday, March 9, 2018 8:19:10 AM To: gpfsug-discuss at spectrumscale.org Cc: Saula, Oluwasijibomi Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Sat Mar 10 16:31:36 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Sat, 10 Mar 2018 16:31:36 +0000 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: References: , <20180309091910.0334604a@osc.edu>, Message-ID: <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> Hi, Saula, This sounds like the problem with the jumbo frame. Ping or metadata query use small packets, so any time you can ping or ls file. However, data transferring use large packets, the MTU size. Your MTU 65536 nodes send out large packets, but they get dropped to the 2044 nodes, because the packet size cannot fit in 2044 size limit. The reverse is ok. I think the gpfs client nodes always communicate with each other to sync the sdr repo files, or other user job mpi communications if there are any. I think all the nodes should agree on a single MTU. I guess ipoib supports up to 4096. I might missed your Ethernet network switch part whether jumbo frame is enabled or not, if you are using any. Wei Guo On Sat, Mar 10, 2018 at 8:29 AM -0600, "Saula, Oluwasijibomi" > wrote: Wei - So the expelled node could ping the rest of the cluster just fine. In fact, after adding this new node to the cluster I could traverse the filesystem for simple lookups, however, heavy data moves in or out of the filesystem seemed to trigger the expel messages to the new node. This experience prompted my tunning exercise on the node and has since resolved the expel messages to node even during times of high I/O activity. Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS may not be optimal. To answer your questions, Ed - IB supports both administrative and daemon communications, and we have verbsRdma configured. Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've been told this should not be the case. I'm hoping to settle on 4096 MTU nodes for the entire cluster but I fear there may be some caveats - any thoughts on this? (Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career with NDSU but he left us shortly after. Maybe like you I can tune up my Japanese as well once my GPFS issues are put to rest! ? ) Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: Edward Wahl Sent: Friday, March 9, 2018 8:19:10 AM To: gpfsug-discuss at spectrumscale.org Cc: Saula, Oluwasijibomi Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes Welcome to the list. If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me. Though I recall he may have left. A couple of questions as I, unfortunately, have a good deal of expel experience. -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" -Are you using the IB as the administrative IP network? -As Wei asked, can nodes sending the expel requests ping the victim over whatever interface is being used administratively? Other interfaces do NOT matter for expels. Nodes that cannot even mount the file systems can still request expels. Many many things can cause issues here from routing and firewalls to bad switch software which will not update ARP tables, and you get nodes trying to expel each other. -are your NSDs logging the expels in /tmp/mmfs? You can mmchconfig expelDataCollectionDailyLimit if you need more captures to narrow down what is happening outside the mmfs.log.latest. Just be wary of the disk space if you have "expel storms". -That tuning page is very out of date and appears to be mostly focused on GPFS 3.5.x tuning. While there is also a Spectrum Scale wiki, it's Linux tuning page does not appear to be kernel and network focused and is dated even older. Ed On Thu, 8 Mar 2018 15:06:03 +0000 "Saula, Oluwasijibomi" wrote: > Hi Folks, > > > As this is my first post to the group, let me start by saying I applaud the > commentary from the user group as it has been a resource to those of us > watching from the sidelines. > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > some issues on our IB FDR fabric which manifested when GPFS began sending > persistent expel messages to particular nodes. > > > Shortly after, we embarked on a tuning exercise using IBM tuning > recommendations > but this page is quite old and we've run into some snags, specifically with > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > like to solicit some advice as to whether 4k MTUs are a good idea and any > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > Datagram mode. > > > Also, any pointers to best practices or resources for network configurations > for heavy I/O clusters would be much appreciated. > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > NORTH DAKOTA STATE UNIVERSITY > > > Research 2 > Building > ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 > www.ccast.ndsu.edu | > www.ndsu.edu > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 ________________________________ UT Southwestern Medical Center The future of medicine, today. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Sat Mar 10 16:57:49 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 11:57:49 -0500 Subject: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes In-Reply-To: <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> References: <20180309091910.0334604a@osc.edu> <419134D7E7CA914C.fd9e46bf-5e64-4c46-9022-56a976185334@mail.outlook.com> Message-ID: <8fff8715-e67f-b048-f37d-2498c0cac2f7@nasa.gov> I, personally, haven't been burned by mixing UD and RC IPoIB clients on the same fabric but that doesn't mean it can't happen. What I *have* been bitten by a couple times is not having enough entries in the arp cache after bringing a bunch of new nodes online (that made for a long Christmas Eve one year...). You can toggle that via the gc_thresh settings. These settings work for ~3700 nodes (and technically could go much higher). net.ipv4.neigh.default.gc_thresh3 = 10240 net.ipv4.neigh.default.gc_thresh2 = 9216 net.ipv4.neigh.default.gc_thresh1 = 8192 It's the kind of thing that will bite you when you expand the cluster and it may make sense that it's exacerbated by metadata operations because those may require initiating connections to many nodes in the cluster which could blow your arp cache. -Aaron On 3/10/18 11:31 AM, Wei Guo wrote: > Hi, Saula, > > This sounds like the problem with the jumbo frame. > > Ping or metadata query use small packets, so any time you can ping or ls > file. > > However, data transferring use large packets, the MTU size. Your MTU > 65536 nodes send out large packets, but they get dropped to the 2044 > nodes, because the packet size cannot fit in 2044 size limit. The > reverse is ok. > > I think the gpfs client nodes always communicate with each other to sync > the sdr repo files, or other user job mpi communications if there are > any. I think all the nodes should agree on a single MTU. I guess ipoib > supports up to 4096. > > I might missed your Ethernet network switch part whether jumbo frame is > enabled or not, if you are using any. > > Wei Guo > > > > > > > On Sat, Mar 10, 2018 at 8:29 AM -0600, "Saula, Oluwasijibomi" > > wrote: > > Wei -? So the expelled node could ping the rest of the cluster just > fine. In fact, after adding this new node to the cluster I could > traverse the filesystem for simple lookups, however, heavy data > moves in or out of the filesystem seemed to trigger the expel > messages to the new node. > > > This experience prompted my?tunning exercise on the node and has > since resolved the expel messages to node even during times of high > I/O activity. > > > Nevertheless, I still have this nagging feeling that the IPoIB > tuning for GPFS may not be optimal. > > > To answer your questions,?Ed - IB supports both administrative and > daemon communications, and we have verbsRdma configured. > > > Currently, we have both 2044 and 65520 MTU nodes on our IB network > and I've been told this should not be the case. I'm hoping to settle > on 4096 MTU nodes for the entire cluster but I fear there may be > some caveats - any thoughts on this? > > > (Oh, Ed - Hideaki was my mentor for a short while when I began my > HPC career with NDSU but he left us shortly after. Maybe like you I > can tune up my Japanese as well once my GPFS issues are put to rest! > ? ) > > > Thanks, > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > Research 2 > Building > ?? > Room 220B > Dept 4100, PO Box 6050? / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu > ?| > www.ndsu.edu > > ------------------------------------------------------------------------ > *From:* Edward Wahl > *Sent:* Friday, March 9, 2018 8:19:10 AM > *To:* gpfsug-discuss at spectrumscale.org > *Cc:* Saula, Oluwasijibomi > *Subject:* Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes > > Welcome to the list. > > If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des > ne?" for me. > Though I recall he may have left. > > > A couple of questions as I, unfortunately, have a good deal of expel > experience. > > -Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma" > > -Are you using the IB as the administrative IP network? > > -As Wei asked, can nodes sending the expel requests ping the victim over > whatever interface is being used administratively?? Other interfaces > do NOT > matter for expels. Nodes that cannot even mount the file systems can > still > request expels.? Many many things can cause issues here from routing and > firewalls to bad switch software which will not update ARP tables, > and you get > nodes trying to expel each other. > > -are your NSDs logging the expels in /tmp/mmfs?? You can mmchconfig > expelDataCollectionDailyLimit if you need more captures to narrow > down what is > happening outside the mmfs.log.latest.? Just be wary of the disk > space if you > have "expel storms". > > -That tuning page is very out of date and appears to be mostly > focused on GPFS > 3.5.x tuning.?? While there is also a Spectrum Scale wiki, it's > Linux tuning > page does not appear to be kernel and network focused and is dated > even older. > > > Ed > > > > On Thu, 8 Mar 2018 15:06:03 +0000 > "Saula, Oluwasijibomi" wrote: > > > Hi Folks, > > > > > > As this is my first post to the group, let me start by saying I applaud the > > commentary from the user group as it has been a resource to those of us > > watching from the sidelines. > > > > > > That said, we have a GPFS layered on IPoIB, and recently, we started having > > some issues on our IB FDR fabric which manifested when GPFS began sending > > persistent expel messages to particular nodes. > > > > > > Shortly after, we embarked on a tuning exercise using IBM tuning > > recommendations > > but this page is quite old and we've run into some snags, specifically with > > setting 4k MTUs using mlx4_core/mlx4_en module options. > > > > > > While setting 4k MTUs as the guide recommends is our general inclination, I'd > > like to solicit some advice as to whether 4k MTUs are a good idea and any > > hitch-free steps to accomplishing this. I'm getting some conflicting remarks > > from Mellanox support asking why we'd want to use 4k MTUs with Unreliable > > Datagram mode. > > > > > > Also, any pointers to best practices or resources for network configurations > > for heavy I/O clusters would be much appreciated. > > > > > > Thanks, > > > > Siji Saula > > HPC System Administrator > > Center for Computationally Assisted Science & Technology > > NORTH DAKOTA STATE UNIVERSITY > > > > > > Research 2 > > Building > > ? Room 220B Dept 4100, PO Box 6050? / Fargo, ND 58108-6050 p:701.231.7749 > > www.ccast.ndsu.edu | > > www.ndsu.edu > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > > > ------------------------------------------------------------------------ > > UTSouthwestern > > Medical Center > > The future of medicine, today. > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Sat Mar 10 21:39:28 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 16:39:28 -0500 Subject: [gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17) Message-ID: <96bf7c94-f5ee-c046-d835-de500bd20c51@nasa.gov> Hey All, I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap now takes a really long time as in... a *really* long time. Digging into it I can see that the snap command is actually done but the sshd child is left waiting on a sleep process on the clients (a sleep 600 at that). Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10 minutes looks like it'll take a good 10 hours. It seems the trouble is in the runCommand function in gpfs.snap. The function creates a child process to act as a sort of alarm to kill the specified command if it exceeds the timeout. The problem while the alarm process gets killed the kill signal isn't passed to the sleep process (because the sleep command is run as a process inside the "alarm" child shell process). In gpfs.snap changing this: [[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1 to this: [[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants $sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1 seems to fix the behavior. I'll open a PMR for this shortly but I'm just wondering if anyone else has seen this. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Sat Mar 10 21:44:39 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sat, 10 Mar 2018 16:44:39 -0500 Subject: [gpfsug-discuss] spontaneous tracing? Message-ID: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> I found myself with a little treat this morning to the tune of tracing running on the entire cluster of 3500 nodes. There were no logs I could find to indicate *why* the tracing had started but it was clear it was initiated by the cluster manager. Some sleuthing (thanks, collectl!) allowed me to figure out that the tracing started as the command: /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmcommon notifyOverload _asmgr I thought that running "mmchocnfig deadlockOverloadThreshold=0 -i" would stop this from happening again but lo and behold tracing kicked off *again* (with the same caller) some time later even after setting that parameter. What's odd is there are no log events to indicate an overload occurred. Has anyone seen similar behavior? We're on 4.2.3.6 efix17. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From mnaineni at in.ibm.com Mon Mar 12 09:54:50 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Mon, 12 Mar 2018 09:54:50 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> References: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be>, Message-ID: An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Mon Mar 12 10:01:15 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Mon, 12 Mar 2018 11:01:15 +0100 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: References: <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> Message-ID: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be> hi malahal, we already figured that out but were hesitant to share it in case ibm wanted to remove this loophole. but can we assume that manuanlly editing the ganesha.conf and pushing it to ccr is supported? the config file is heavily edited / rewritten when certain mm commands, so we want to make sure we can always do this. it would be even better if the main.conf that is generated/edited by the ccr commands just had an include statement so we can edit another file locally instead of doing mmccr magic. stijn On 03/12/2018 10:54 AM, Malahal R Naineni wrote: > Upstream Ganesha code allows all NFS versions including NFSv4.2. Most Linux > clients were defaulting to NFSv4.0, but now they started using NFS4.1 which IBM > doesn't support. To avoid people accidentally using NFSv4.1, we decided to > remove it by default. > We don't support NFSv4.1, so there is no spectrum command to enable NFSv4.1 > support with PTF6. Of course, if you are familiar with mmccr, you can change the > config and let it use NFSv4.1 but any issues with NFS4.1 will go to /dev/null. :-) > You need to add "minor_versions = 0,1;" to NFSv4{} block > in /var/mmfs/ces/nfs-config/gpfs.ganesha.main.conf to allow NFSv4.0 and NFsv4.1, > and make sure you use mmccr command to make this change permanent. > Regards, Malahal. > > ----- Original message ----- > From: Stijn De Weirdt > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: Re: [gpfsug-discuss] asking for your vote for an RFE to support NFS > V4.1 > Date: Fri, Mar 9, 2018 6:13 PM > hi all, > > i would second this request to upvote this. the fact that 4.1 support > was dropped in a subsubminor update (4.2.3.5 to 4.3.26 afaik) was > already pretty bad to discover, but at the very least there should be an > option to reenable it. > > i'm also interested why this was removed (or actively prevented to > enable). i can understand that eg pnfs is not support, bu basic protocol > features wrt HA are a must have. > only with 4.1 are we able to do ces+ganesha failover without IO error, > something that should be basic feature nowadays. > > stijn > > On 03/09/2018 01:21 PM, Engeli Willi (ID SD) wrote: > > Hello Group, > > > > I?ve just created a request for enhancement (RFE) to have ganesha supporting > > NFS V4.1. > > > > It is important, to have this new Protocol version supported, since our > > Linux clients default support is more then 80% based in this version by > > default and Linux distributions are actively pushing this Protocol. > > > > The protocol also brings important corrections and enhancements with it. > > > > > > > > I would like to ask you all very kindly to vote for this RFE please. > > > > You find it here: https://www.ibm.com/developerworks/rfe/execute > > > > Headline:NFS V4.1 Support > > > > ID:117398 > > > > > > > > > > > > Freundliche Gr?sse > > > > > > > > Willi Engeli > > > > ETH Zuerich > > > > ID Speicherdienste > > > > Weinbergstrasse 11 > > > > WEC C 18 > > > > 8092 Zuerich > > > > > > > > Tel: +41 44 632 02 69 > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIF-g&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=yq4xoVKCPWQTqZVp0BgG8fBpXrS2FehGlAua1Eixci4&s=9DJi6qkF4eRc81vv6SlC3gxKL9oJJ4efkktzNaZAnkA&e= > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIF-g&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=yq4xoVKCPWQTqZVp0BgG8fBpXrS2FehGlAua1Eixci4&s=9DJi6qkF4eRc81vv6SlC3gxKL9oJJ4efkktzNaZAnkA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From xhejtman at ics.muni.cz Mon Mar 12 14:51:05 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Mar 2018 15:51:05 +0100 Subject: [gpfsug-discuss] Preferred NSD Message-ID: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek From scale at us.ibm.com Mon Mar 12 15:13:00 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Mon, 12 Mar 2018 09:13:00 -0600 Subject: [gpfsug-discuss] spontaneous tracing? In-Reply-To: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> References: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> Message-ID: /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be started. One can verify that using the underlying command being called as shown in the following example with /tmp/n containing node names one each line that will get the notification and the IP address being the file system manager from which the command is issued. /usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8 The only case that deadlock detection code will initiate tracing is that debugDataControl is set to "heavy" and tracing is not started. Then on deadlock detection tracing is turned on for 20 seconds and turned off. That can be tested using command like /usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8 And then mmfs.log will tell you what's going on. That's not a silent action. 2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock notification from 192.168.117.131 2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug data on this node. 2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing started Trace started: Wait 20 seconds before cut and stop trace 2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped 20 seconds later mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0 mmtrace: formatting /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz > What's odd is there are no log events to indicate an overload occurred. Overload msg is only seen in mmfs.log when debugDataControl is "heavy". mmdiag --deadlock shows overload related info starting from 4.2.3. # mmdiag --deadlock === mmdiag: deadlock === Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for short waiters Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on c69bc2xn01 is 0.01812 <== -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Mar 12 15:14:10 2018 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Mon, 12 Mar 2018 15:14:10 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: Hi Lukas, Check out FPO mode. That mimics Hadoop?s data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero?s NVMesh (note: not an endorsement since I can?t give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I?m not sure if they?ve released that feature yet but in theory it will give better fault tolerance *and* you?ll get more efficient usage of your SSDs. I?m sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Mar 12 15:18:40 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 12 Mar 2018 11:18:40 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: <188417.1520867920@turing-police.cc.vt.edu> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > I don't think like 5 or more data/metadata replicas are practical here. On the > other hand, multiple node failures is something really expected. Umm.. do I want to ask *why*, out of only 60 nodes, multiple node failures are an expected event - to the point that you're thinking about needing 5 replicas to keep things running? From xhejtman at ics.muni.cz Mon Mar 12 15:23:17 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 12 Mar 2018 16:23:17 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <188417.1520867920@turing-police.cc.vt.edu> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> Message-ID: <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: > On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > > I don't think like 5 or more data/metadata replicas are practical here. On the > > other hand, multiple node failures is something really expected. > > Umm.. do I want to ask *why*, out of only 60 nodes, multiple node > failures are an expected event - to the point that you're thinking > about needing 5 replicas to keep things running? as of my experience with cluster management, we have multiple nodes down on regular basis. (HW failure, SW maintenance and so on.) I'm basically thinking that 2-3 replicas might not be enough while 5 or more are becoming too expensive (both disk space and required bandwidth being scratch space - high i/o load expected). -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From mnaineni at in.ibm.com Mon Mar 12 17:41:41 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Mon, 12 Mar 2018 17:41:41 +0000 Subject: [gpfsug-discuss] asking for your vote for an RFE to support NFS V4.1 In-Reply-To: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be> References: <364c69aa-3964-0c7c-21ae-6f4fd36ab9e8@ugent.be>, <8e5684b5-8981-0710-c5ca-693f6ad56890@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From Philipp.Rehs at uni-duesseldorf.de Mon Mar 12 20:09:14 2018 From: Philipp.Rehs at uni-duesseldorf.de (Philipp Helo Rehs) Date: Mon, 12 Mar 2018 21:09:14 +0100 Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8@uni-duesseldorf.de> Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 From zmance at ucar.edu Mon Mar 12 22:10:06 2018 From: zmance at ucar.edu (Zachary Mance) Date: Mon, 12 Mar 2018 16:10:06 -0600 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Since I am testing out remote mounting with EDR IB routers, I'll add to the discussion. In my lab environment I was seeing the same rdma connections being established and then disconnected shortly after. The remote filesystem would eventually mount on the clients, but it look a quite a while (~2mins). Even after mounting, accessing files or any metadata operations would take a while to execute, but eventually it happened. After enabling verbsRdmaCm, everything mounted just fine and in a timely manner. Spectrum Scale was using the librdmacm.so library. I would first double check that you have both clusters able to talk to each other on their IPoIB address, then make sure you enable verbsRdmaCm on both clusters. --------------------------------------------------------------------------------------------------------------- Zach Mance zmance at ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Thu, Mar 1, 2018 at 1:41 AM, John Hearns wrote: > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and rely > heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would appreciate some > more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my cost!) > but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, > but pulled it out of our environment as adding unneeded complexity. It > requires provisioning IP addresses across the Infiniband infrastructure and > possibly adding routers to other portions of the IP infrastructure. It was > also confusing some users due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their support for > GPFS over IB and they kept directing us to using CM (which still didn't > work). CM wasn't necessary once we found out about the actual problem (we > needed the undocumented verbsRdmaUseGidIndexZero configuration option among > other things due to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it. It sounds like the OP is keeping > IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister > > Reply-To: gpfsug main discussion list > > > > To: gpfsug-discuss at spectrumscale.org > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA CM to > work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the network > 192.168.11.0/24. > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > > are connected to a second infiniband network. These servers have IPs > > > on their IB interfaces in the network 192.168.12.0/24. > > > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug- > discuss&data=01%7C01%7Cjohn.hearns%40asml.com% > 7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad > 61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP > 8%3D&reserved=0 > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Tue Mar 13 03:06:34 2018 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Tue, 13 Mar 2018 03:06:34 +0000 Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR (Philipp Helo Rehs) Message-ID: <7b8dd0540c4542668f24c1a20c7aee76@SWMS13MAIL10.swmed.org> Hi, Philipp, FYI. We had exactly the same IBV_WC_RETRY_EXC_ERR error message in our gpfs client log along with other client error kernel: ib0: ipoib_cm_handle_tx_wc_rss: failed cm send event (status=12, wrid=83 vend_err 81) in the syslog. The root cause was a bad IB cable connecting a leaf switch to the core switch where the client used as route. When we changed a new cable, the problem was solved and no more errors. We don't really have ipoib setup. The problem might be different from yours, but does the error message suggest that when your gpfs daemon tries to use mlx5_1, the packets are discarded so no connection? Did you do an IB bonding? Wei Guo HPC Administrator UTSW Message: 1 Date: Mon, 12 Mar 2018 21:09:14 +0100 From: Philipp Helo Rehs To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8 at uni-duesseldorf.de> Content-Type: text/plain; charset=utf-8 Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 ________________________________ UT Southwestern Medical Center The future of medicine, today. From aaron.s.knister at nasa.gov Tue Mar 13 04:49:33 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 13 Mar 2018 00:49:33 -0400 Subject: [gpfsug-discuss] spontaneous tracing? In-Reply-To: References: <3925216a-3146-357a-9aa0-0acc84bdb7b2@nasa.gov> Message-ID: Thanks! I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon notifyOverload" does in fact start tracing for me on one of our clusters (technically 2, one in dev, one in prod). It did *not* start it on another test cluster. It looks to me like the difference is the mmsdrservport settings. On clusters where it's set to 0 tracing *does* start. On clusters where it's set to the default of 1191 (didn't try any other value) tracing *does not* start. I can toggle the behavior by changing the value of mmsdrservport back and forth. I do have a PMR open for this so I'll follow up there too. Thanks again for the help. -Aaron On 3/12/18 11:13 AM, IBM Spectrum Scale wrote: > /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be > started. ?One can verify that using the underlying command being called > as shown in the following example with /tmp/n containing node names one > each line that will get the notification and the IP address being the > file system manager from which the command is issued. > > */usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8* > > The only case that deadlock detection code will initiate tracing is that > debugDataControl is set to "heavy" and tracing is not started. Then on > deadlock detection tracing is turned on for 20 seconds and turned off. > > That can be tested using command like > */usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8* > > And then mmfs.log will tell you what's going on. That's not a silent action. > > *2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock > notification from 192.168.117.131* > *2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug > data on this node.* > *2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing > started* > *Trace started: Wait 20 seconds before cut and stop trace* > *2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped > 20 seconds later* > *mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 > /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0* > *mmtrace: formatting > /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to > /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz* > > > What's odd is there are no log events to indicate an overload occurred. > > Overload msg is only seen in mmfs.log when debugDataControl is "heavy". > mmdiag --deadlock shows overload related info starting from 4.2.3. > > *# mmdiag --deadlock* > > *=== mmdiag: deadlock ===* > > *Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds* > *Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for > short waiters* > > *Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on > c69bc2xn01 is 0.01812 <==* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From john.hearns at asml.com Tue Mar 13 10:37:43 2018 From: john.hearns at asml.com (John Hearns) Date: Tue, 13 Mar 2018 10:37:43 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Tue Mar 13 14:16:30 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 13 Mar 2018 15:16:30 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> Message-ID: <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > Lukas, > It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. > > * I'm thinking about the following setup: > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > There is nothing wrong with this concept, for instance see > https://www.beegfs.io/wiki/BeeOND > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] > Sent: Monday, March 12, 2018 4:14 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD > > Hi Lukas, > > Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. > > You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. > > I'm sure there are other ways to skin this cat too. > > -Aaron > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: > Hello, > > I'm thinking about the following setup: > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each > SSDs as on NSD. > > I don't think like 5 or more data/metadata replicas are practical here. On the > other hand, multiple node failures is something really expected. > > Is there a way to instrument that local NSD is strongly preferred to store > data? I.e. node failure most probably does not result in unavailable data for > the other nodes? > > Or is there any other recommendation/solution to build shared scratch with > GPFS in such setup? (Do not do it including.) > > -- > Luk?? Hejtm?nek > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From jan.sundermann at kit.edu Tue Mar 13 14:35:36 2018 From: jan.sundermann at kit.edu (Jan Erik Sundermann) Date: Tue, 13 Mar 2018 15:35:36 +0100 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hi John We try to route infiniband traffic. The IP traffic is routed separately. The two clusters we try to connect are configured differently, one with IP over IB the other one with dedicated ethernet adapters. Jan Erik On 02/27/2018 10:17 AM, John Hearns wrote: > Jan Erik, > Can you clarify if you are routing IP traffic between the two Infiniband networks. > Or are you routing Infiniband traffic? > > > If I can be of help I manage an Infiniband network which connects to other IP networks using Mellanox VPI gateways, which proxy arp between IB and Ethernet. But I am not running GPFS traffic over these. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sundermann, Jan Erik (SCC) > Sent: Monday, February 26, 2018 5:39 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] Problems with remote mount via routed IB > > > Dear all > > we are currently trying to remote mount a file system in a routed Infiniband test setup and face problems with dropped RDMA connections. The setup is the following: > > - Spectrum Scale Cluster 1 is setup on four servers which are connected to the same infiniband network. Additionally they are connected to a fast ethernet providing ip communication in the network 192.168.11.0/24. > > - Spectrum Scale Cluster 2 is setup on four additional servers which are connected to a second infiniband network. These servers have IPs on their IB interfaces in the network 192.168.12.0/24. > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated machine. > > - We have a dedicated IB hardware router connected to both IB subnets. > > > We tested that the routing, both IP and IB, is working between the two clusters without problems and that RDMA is working fine both for internal communication inside cluster 1 and cluster 2 > > When trying to remote mount a file system from cluster 1 in cluster 2, RDMA communication is not working as expected. Instead we see error messages on the remote host (cluster 2) > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2 > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2 > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 3 > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3 > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 1 > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1 > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 1 > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 0 > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0 > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 0 > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 2 > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2 > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2 > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 index 3 > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3 > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > > > and in the cluster with the file system (cluster 1) > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129 > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected to 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3 > > > > Any advice on how to configure the setup in a way that would allow the remote mount via routed IB would be very appreciated. > > > Thank you and best regards > Jan Erik > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Jan Erik Sundermann Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 26191 Email: jan.sundermann at kit.edu www.scc.kit.edu KIT ? The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5382 bytes Desc: S/MIME Cryptographic Signature URL: From Robert.Oesterlin at nuance.com Tue Mar 13 14:42:24 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 13 Mar 2018 14:42:24 +0000 Subject: [gpfsug-discuss] SSUG USA Spring Meeting - Registration and call for speakers is now open! Message-ID: <1289B944-B4F5-40E8-861C-33423B318457@nuance.com> The registration for the Spring meeting of the SSUG-USA is now open. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Please note that we have limited meeting space so please register only if you?re sure you can attend. Detailed agenda will be published in the coming weeks. If you are interested in presenting, please contact me. I do have several speakers lined up already, but we can use a few more. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From jan.sundermann at kit.edu Tue Mar 13 15:24:13 2018 From: jan.sundermann at kit.edu (Jan Erik Sundermann) Date: Tue, 13 Mar 2018 16:24:13 +0100 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hello Zachary We are currently changing out setup to have IP over IB on all machines to be able to enable verbsRdmaCm. According to Mellanox (https://community.mellanox.com/docs/DOC-2384) ibacm requires pre-populated caches to be distributed to all end hosts with the mapping of IP to the routable GIDs (of both IB subnets). Was this also required in your successful deployment? Best Jan Erik On 03/12/2018 11:10 PM, Zachary Mance wrote: > Since I am testing out remote mounting with EDR IB routers, I'll add to > the discussion. > > In my lab environment I was seeing the same rdma connections being > established and then disconnected shortly after. The remote filesystem > would eventually mount on the clients, but it look a quite a while > (~2mins). Even after mounting, accessing files or any metadata > operations would take a while to execute, but eventually it happened. > > After enabling verbsRdmaCm, everything mounted just fine and in a timely > manner. Spectrum Scale was using the?librdmacm.so library. > > I would first double check that you have both clusters able to talk to > each other on their IPoIB address, then make sure you enable verbsRdmaCm > on both clusters. > > > --------------------------------------------------------------------------------------------------------------- > Zach Mance zmance at ucar.edu ?(303) 497-1883 > HPC Data Infrastructure Group?/ CISL / NCAR > --------------------------------------------------------------------------------------------------------------- > > > On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > wrote: > > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and > rely heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would > appreciate some more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my > cost!) but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > > [mailto:gpfsug-discuss-bounces at spectrumscale.org > ] On Behalf Of > Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB.? We did run IPoIB years > ago, but pulled it out of our environment as adding unneeded > complexity.? It requires provisioning IP addresses across the > Infiniband infrastructure and possibly adding routers to other > portions of the IP infrastructure.? It was also confusing some users > due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their > support for GPFS over IB and they kept directing us to using CM > (which still didn't work).? CM wasn't necessary once we found out > about the actual problem (we needed the undocumented > verbsRdmaUseGidIndexZero configuration option among other things due > to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it.? It sounds like the OP is > keeping IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister > > > Reply-To: gpfsug main discussion list > > > > > To: gpfsug-discuss at spectrumscale.org > > > Subject: Re: [gpfsug-discuss] Problems with remote mount via > routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA > CM to work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the > network 192.168.11.0/24 . > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers > which > > > are connected to a second infiniband network. These servers > have IPs > > > on their IB interfaces in the network 192.168.12.0/24 > . > > > > > > - IP is routed between 192.168.11.0/24 > and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB > subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine > both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on > mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > > fsug.org > %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com > %7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > > ug.org > %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com > %7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but > never lost! > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --? Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 > > -- The information contained in this communication and any > attachments is confidential and may be privileged, and is for the > sole use of the intended recipient(s). Any unauthorized review, use, > disclosure or distribution is prohibited. Unless explicitly stated > otherwise in the body of this communication or the attachment > thereto (if any), the information is provided on an AS-IS basis > without any express or implied warranties or liabilities. To the > extent you are relying on this information, you are doing so at your > own risk. If you are not the intended recipient, please notify the > sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor > the company/group of companies he or she represents shall be liable > for the proper and complete transmission of the information > contained in this communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Jan Erik Sundermann Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 26191 Email: jan.sundermann at kit.edu www.scc.kit.edu KIT ? The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5382 bytes Desc: S/MIME Cryptographic Signature URL: From alex at calicolabs.com Tue Mar 13 17:48:21 2018 From: alex at calicolabs.com (Alex Chekholko) Date: Tue, 13 Mar 2018 10:48:21 -0700 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> Message-ID: Hi Lukas, I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek wrote: > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > Lukas, > > It looks like you are proposing a setup which uses your compute servers > as storage servers also? > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > servers.. Using them as a shared scratch area with GPFS is one of the > options. > > > > > * I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > There is nothing wrong with this concept, for instance see > > https://www.beegfs.io/wiki/BeeOND > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > You should look at "failure zones" also. > > you still need the storage servers and local SSDs to use only for caching, > do > I understand correctly? > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > Sent: Monday, March 12, 2018 4:14 PM > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > Hi Lukas, > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > can have up to 3 replicas both data and metadata but still the downside, > though, as you say is the wrong node failures will take your cluster down. > > > > You might want to check out something like Excelero's NVMesh (note: not > an endorsement since I can't give such things) which can create logical > volumes across all your NVMe drives. The product has erasure coding on > their roadmap. I'm not sure if they've released that feature yet but in > theory it will give better fault tolerance *and* you'll get more efficient > usage of your SSDs. > > > > I'm sure there are other ways to skin this cat too. > > > > -Aaron > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: > > Hello, > > > > I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > I would like to setup shared scratch area using GPFS and those NVMe > SSDs. Each > > SSDs as on NSD. > > > > I don't think like 5 or more data/metadata replicas are practical here. > On the > > other hand, multiple node failures is something really expected. > > > > Is there a way to instrument that local NSD is strongly preferred to > store > > data? I.e. node failure most probably does not result in unavailable > data for > > the other nodes? > > > > Or is there any other recommendation/solution to build shared scratch > with > > GPFS in such setup? (Do not do it including.) > > > > -- > > Luk?? Hejtm?nek > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the body > of this communication or the attachment thereto (if any), the information > is provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zmance at ucar.edu Tue Mar 13 19:38:48 2018 From: zmance at ucar.edu (Zachary Mance) Date: Tue, 13 Mar 2018 13:38:48 -0600 Subject: [gpfsug-discuss] Problems with remote mount via routed IB In-Reply-To: References: <471B111F-5DAA-4912-829C-9AA75DCB76FA@kit.edu> Message-ID: Hi Jan, I am NOT using the pre-populated cache that mellanox refers to in it's documentation. After chatting with support, I don't believe that's necessary anymore (I didn't get a straight answer out of them). For the subnet prefix, make sure to use one from the range 0xfec0000000000000-0xfec000000000001f. --------------------------------------------------------------------------------------------------------------- Zach Mance zmance at ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Tue, Mar 13, 2018 at 9:24 AM, Jan Erik Sundermann wrote: > Hello Zachary > > We are currently changing out setup to have IP over IB on all machines to > be able to enable verbsRdmaCm. > > According to Mellanox (https://community.mellanox.com/docs/DOC-2384) > ibacm requires pre-populated caches to be distributed to all end hosts with > the mapping of IP to the routable GIDs (of both IB subnets). Was this also > required in your successful deployment? > > Best > Jan Erik > > > > On 03/12/2018 11:10 PM, Zachary Mance wrote: > >> Since I am testing out remote mounting with EDR IB routers, I'll add to >> the discussion. >> >> In my lab environment I was seeing the same rdma connections being >> established and then disconnected shortly after. The remote filesystem >> would eventually mount on the clients, but it look a quite a while >> (~2mins). Even after mounting, accessing files or any metadata operations >> would take a while to execute, but eventually it happened. >> >> After enabling verbsRdmaCm, everything mounted just fine and in a timely >> manner. Spectrum Scale was using the librdmacm.so library. >> >> I would first double check that you have both clusters able to talk to >> each other on their IPoIB address, then make sure you enable verbsRdmaCm on >> both clusters. >> >> >> ------------------------------------------------------------ >> --------------------------------------------------- >> Zach Mance zmance at ucar.edu (303) 497-1883 >> HPC Data Infrastructure Group / CISL / NCAR >> ------------------------------------------------------------ >> --------------------------------------------------- >> >> On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > > wrote: >> >> In reply to Stuart, >> our setup is entirely Infiniband. We boot and install over IB, and >> rely heavily on IP over Infiniband. >> >> As for users being 'confused' due to multiple IPs, I would >> appreciate some more depth on that one. >> Sure, all batch systems are sensitive to hostnames (as I know to my >> cost!) but once you get that straightened out why should users care? >> I am not being aggressive, just keen to find out more. >> >> >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> >> [mailto:gpfsug-discuss-bounces at spectrumscale.org >> ] On Behalf Of >> Stuart Barkley >> Sent: Wednesday, February 28, 2018 6:50 PM >> To: gpfsug main discussion list > > >> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB >> >> The problem with CM is that it seems to require configuring IP over >> Infiniband. >> >> I'm rather strongly opposed to IP over IB. We did run IPoIB years >> ago, but pulled it out of our environment as adding unneeded >> complexity. It requires provisioning IP addresses across the >> Infiniband infrastructure and possibly adding routers to other >> portions of the IP infrastructure. It was also confusing some users >> due to multiple IPs on the compute infrastructure. >> >> We have recently been in discussions with a vendor about their >> support for GPFS over IB and they kept directing us to using CM >> (which still didn't work). CM wasn't necessary once we found out >> about the actual problem (we needed the undocumented >> verbsRdmaUseGidIndexZero configuration option among other things due >> to their use of SR-IOV based virtual IB interfaces). >> >> We don't use routed Infiniband and it might be that CM and IPoIB is >> required for IB routing, but I doubt it. It sounds like the OP is >> keeping IB and IP infrastructure separate. >> >> Stuart Barkley >> >> On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: >> >> > Date: Mon, 26 Feb 2018 14:16:34 >> > From: Aaron Knister > > >> > Reply-To: gpfsug main discussion list >> > > > >> > To: gpfsug-discuss at spectrumscale.org >> >> > Subject: Re: [gpfsug-discuss] Problems with remote mount via >> routed IB >> > >> > Hi Jan Erik, >> > >> > It was my understanding that the IB hardware router required RDMA >> CM to work. >> > By default GPFS doesn't use the RDMA Connection Manager but it can >> be >> > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart >> on >> > clients/servers (in both clusters) to take effect. Maybe someone >> else >> > on the list can comment in more detail-- I've been told folks have >> > successfully deployed IB routers with GPFS. >> > >> > -Aaron >> > >> > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: >> > > >> > > Dear all >> > > >> > > we are currently trying to remote mount a file system in a routed >> > > Infiniband test setup and face problems with dropped RDMA >> > > connections. The setup is the >> > > following: >> > > >> > > - Spectrum Scale Cluster 1 is setup on four servers which are >> > > connected to the same infiniband network. Additionally they are >> > > connected to a fast ethernet providing ip communication in the >> network 192.168.11.0/24 . >> > > >> > > - Spectrum Scale Cluster 2 is setup on four additional servers >> which >> > > are connected to a second infiniband network. These servers >> have IPs >> > > on their IB interfaces in the network 192.168.12.0/24 >> . >> > > >> > > - IP is routed between 192.168.11.0/24 >> and 192.168.12.0/24 on a >> >> > > dedicated machine. >> > > >> > > - We have a dedicated IB hardware router connected to both IB >> subnets. >> > > >> > > >> > > We tested that the routing, both IP and IB, is working between >> the >> > > two clusters without problems and that RDMA is working fine >> both for >> > > internal communication inside cluster 1 and cluster 2 >> > > >> > > When trying to remote mount a file system from cluster 1 in >> cluster >> > > 2, RDMA communication is not working as expected. Instead we see >> > > error messages on the remote host (cluster 2) >> > > >> > > >> > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 1 >> > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 1 >> > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 1 >> > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 0 >> > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 0 >> > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 0 >> > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 2 >> > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > and in the cluster with the file system (cluster 1) >> > > >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > >> > > Any advice on how to configure the setup in a way that would >> allow >> > > the remote mount via routed IB would be very appreciated. >> > > >> > > >> > > Thank you and best regards >> > > Jan Erik >> > > >> > > >> > > >> > > >> > > _______________________________________________ >> > > gpfsug-discuss mailing list >> > > gpfsug-discuss at spectrumscale.org >> > > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp >> > > >> > > fsug.org >> %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data >> =01%7C01%7Cjohn.h >> > > earns%40asml.com >> %7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944e >> > > >> b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE >> > > YpqcNNP8%3D&reserved=0 >> > > >> > >> > -- >> > Aaron Knister >> > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight >> > Center >> > (301) 286-2776 >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org >> > >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfs >> > 3A%2F%2Fgpfs> >> > ug.org >> %2Fmailman%2Flistinfo%2Fgpfsug-discuss&data= >> 01%7C01%7Cjohn.hearn >> > s%40asml.com >> %7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944eb2a39d >> > >> 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOS >> REYpqcNNP8 >> > %3D&reserved=0 >> > >> >> -- >> I've never been lost; I was once bewildered for three days, but >> never lost! >> -- Daniel Boone >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 >> > 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0> >> -- The information contained in this communication and any >> attachments is confidential and may be privileged, and is for the >> sole use of the intended recipient(s). Any unauthorized review, use, >> disclosure or distribution is prohibited. Unless explicitly stated >> otherwise in the body of this communication or the attachment >> thereto (if any), the information is provided on an AS-IS basis >> without any express or implied warranties or liabilities. To the >> extent you are relying on this information, you are doing so at your >> own risk. If you are not the intended recipient, please notify the >> sender immediately by replying to this message and destroy all >> copies of this message and any attachments. Neither the sender nor >> the company/group of companies he or she represents shall be liable >> for the proper and complete transmission of the information >> contained in this communication, or for any delay in its receipt. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> > -- > > Karlsruhe Institute of Technology (KIT) > Steinbuch Centre for Computing (SCC) > > Jan Erik Sundermann > > Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 > D-76344 Eggenstein-Leopoldshafen > > Tel: +49 721 608 26191 > Email: jan.sundermann at kit.edu > www.scc.kit.edu > > KIT ? The Research University in the Helmholtz Association > > Since 2010, KIT has been certified as a family-friendly university. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Wed Mar 14 09:28:15 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Wed, 14 Mar 2018 10:28:15 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> Message-ID: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From luis.bolinches at fi.ibm.com Wed Mar 14 10:11:31 2018 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 14 Mar 2018 10:11:31 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: Hi For reads only have you look at possibility of using LROC? For writes in the setup you mention you are down to maximum of half your network speed (best case) assuming no restripes no reboots on going at any given time. -- Yst?v?llisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Consultant IT Specialist Mobile Phone: +358503112585 https://www.youracclaim.com/user/luis-bolinches "If you always give you will always have" -- Anonymous > On 14 Mar 2018, at 5.28, Lukas Hejtmanek wrote: > > Hello, > > thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe > disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD > that could build nice shared scratch. Moreover, I have no different HW or place > to put these SSDs into. They have to be in the compute nodes. > >> On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: >> I would like to discourage you from building a large distributed clustered >> filesystem made of many unreliable components. You will need to >> overprovision your interconnect and will also spend a lot of time in >> "healing" or "degraded" state. >> >> It is typically cheaper to centralize the storage into a subset of nodes >> and configure those to be more highly available. E.g. of your 60 nodes, >> take 8 and put all the storage into those and make that a dedicated GPFS >> cluster with no compute jobs on those nodes. Again, you'll still need >> really beefy and reliable interconnect to make this work. >> >> Stepping back; what is the actual problem you're trying to solve? I have >> certainly been in that situation before, where the problem is more like: "I >> have a fixed hardware configuration that I can't change, and I want to try >> to shoehorn a parallel filesystem onto that." >> >> I would recommend looking closer at your actual workloads. If this is a >> "scratch" filesystem and file access is mostly from one node at a time, >> it's not very useful to make two additional copies of that data on other >> nodes, and it will only slow you down. >> >> Regards, >> Alex >> >> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek >> wrote: >> >>>> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: >>>> Lukas, >>>> It looks like you are proposing a setup which uses your compute servers >>> as storage servers also? >>> >>> yes, exactly. I would like to utilise NVMe SSDs that are in every compute >>> servers.. Using them as a shared scratch area with GPFS is one of the >>> options. >>> >>>> >>>> * I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected >>>> >>>> There is nothing wrong with this concept, for instance see >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.beegfs.io_wiki_BeeOND&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZUDwVonh6dmGRFw0n9p9QPC2-DFuVyY75gOuD02c07I&e= >>>> >>>> I have an NVMe filesystem which uses 60 drives, but there are 10 servers. >>>> You should look at "failure zones" also. >>> >>> you still need the storage servers and local SSDs to use only for caching, >>> do >>> I understand correctly? >>> >>>> >>>> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- >>> bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. >>> (GSFC-606.2)[COMPUTER SCIENCE CORP] >>>> Sent: Monday, March 12, 2018 4:14 PM >>>> To: gpfsug main discussion list >>>> Subject: Re: [gpfsug-discuss] Preferred NSD >>>> >>>> Hi Lukas, >>>> >>>> Check out FPO mode. That mimics Hadoop's data placement features. You >>> can have up to 3 replicas both data and metadata but still the downside, >>> though, as you say is the wrong node failures will take your cluster down. >>>> >>>> You might want to check out something like Excelero's NVMesh (note: not >>> an endorsement since I can't give such things) which can create logical >>> volumes across all your NVMe drives. The product has erasure coding on >>> their roadmap. I'm not sure if they've released that feature yet but in >>> theory it will give better fault tolerance *and* you'll get more efficient >>> usage of your SSDs. >>>> >>>> I'm sure there are other ways to skin this cat too. >>>> >>>> -Aaron >>>> >>>> >>>> >>>> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek >> > wrote: >>>> Hello, >>>> >>>> I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected >>>> >>>> I would like to setup shared scratch area using GPFS and those NVMe >>> SSDs. Each >>>> SSDs as on NSD. >>>> >>>> I don't think like 5 or more data/metadata replicas are practical here. >>> On the >>>> other hand, multiple node failures is something really expected. >>>> >>>> Is there a way to instrument that local NSD is strongly preferred to >>> store >>>> data? I.e. node failure most probably does not result in unavailable >>> data for >>>> the other nodes? >>>> >>>> Or is there any other recommendation/solution to build shared scratch >>> with >>>> GPFS in such setup? (Do not do it including.) >>>> >>>> -- >>>> Luk?? Hejtm?nek >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>>> -- The information contained in this communication and any attachments >>> is confidential and may be privileged, and is for the sole use of the >>> intended recipient(s). Any unauthorized review, use, disclosure or >>> distribution is prohibited. Unless explicitly stated otherwise in the body >>> of this communication or the attachment thereto (if any), the information >>> is provided on an AS-IS basis without any express or implied warranties or >>> liabilities. To the extent you are relying on this information, you are >>> doing so at your own risk. If you are not the intended recipient, please >>> notify the sender immediately by replying to this message and destroy all >>> copies of this message and any attachments. Neither the sender nor the >>> company/group of companies he or she represents shall be liable for the >>> proper and complete transmission of the information contained in this >>> communication, or for any delay in its receipt. >>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>> >>> >>> -- >>> Luk?? Hejtm?nek >>> >>> Linux Administrator only because >>> Full Time Multitasking Ninja >>> is not an official job title >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= >>> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint. com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFBA&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=clDRNGIKhf6SYQ2ZZpZvBniUiqx1GU1bYEUbcbCunuo&s=ZLEoHFOFkjfuvNw57WqNn6-EVjHbASRHgnmRc2YYXpM&e= > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 14 10:24:39 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 10:24:39 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: I would look at using LROC and possibly using HAWC ... Note you need to be a bit careful with HAWC client side and failure group placement. Simon ?On 14/03/2018, 09:28, "gpfsug-discuss-bounces at spectrumscale.org on behalf of xhejtman at ics.muni.cz" wrote: Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From zacekm at img.cas.cz Wed Mar 14 10:57:36 2018 From: zacekm at img.cas.cz (Michal Zacek) Date: Wed, 14 Mar 2018 11:57:36 +0100 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> Message-ID: <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> Hi, I don't think the GPFS is good choice for your setup. Did you consider GlusterFS? It's used at Max Planck Institute at Dresden for HPC computing of? Molecular Biology data. They have similar setup,? tens (hundreds) of computers with shared local storage in glusterfs. But you will need 10Gb network. Michal Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): > On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: >> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: >>> I don't think like 5 or more data/metadata replicas are practical here. On the >>> other hand, multiple node failures is something really expected. >> Umm.. do I want to ask *why*, out of only 60 nodes, multiple node >> failures are an expected event - to the point that you're thinking >> about needing 5 replicas to keep things running? > as of my experience with cluster management, we have multiple nodes down on > regular basis. (HW failure, SW maintenance and so on.) > > I'm basically thinking that 2-3 replicas might not be enough while 5 or more > are becoming too expensive (both disk space and required bandwidth being > scratch space - high i/o load expected). > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3776 bytes Desc: Elektronicky podpis S/MIME URL: From aaron.s.knister at nasa.gov Wed Mar 14 15:28:53 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 14 Mar 2018 11:28:53 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> Message-ID: <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> I don't want to start a religious filesystem war, but I'd give pause to GlusterFS based on a number of operational issues I've personally experienced and seen others experience with it. I'm curious how glusterfs would resolve the issue here of multiple clients failing simultaneously (unless you're talking about using disperse volumes)? That does, actually, bring up an interesting question to IBM which is -- when will mestor see the light of day? This is admittedly something other filesystems can do that GPFS cannot. -Aaron On 3/14/18 6:57 AM, Michal Zacek wrote: > Hi, > > I don't think the GPFS is good choice for your setup. Did you consider > GlusterFS? It's used at Max Planck Institute at Dresden for HPC > computing of? Molecular Biology data. They have similar setup,? tens > (hundreds) of computers with shared local storage in glusterfs. But you > will need 10Gb network. > > Michal > > > Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): >> On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: >>> On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: >>>> I don't think like 5 or more data/metadata replicas are practical here. On the >>>> other hand, multiple node failures is something really expected. >>> Umm.. do I want to ask *why*, out of only 60 nodes, multiple node >>> failures are an expected event - to the point that you're thinking >>> about needing 5 replicas to keep things running? >> as of my experience with cluster management, we have multiple nodes down on >> regular basis. (HW failure, SW maintenance and so on.) >> >> I'm basically thinking that 2-3 replicas might not be enough while 5 or more >> are becoming too expensive (both disk space and required bandwidth being >> scratch space - high i/o load expected). >> > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From skylar2 at u.washington.edu Wed Mar 14 15:42:37 2018 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Wed, 14 Mar 2018 15:42:37 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <188417.1520867920@turing-police.cc.vt.edu> <20180312152317.mfc6xhsvthticrlh@ics.muni.cz> <1e9df22c-5a50-5e16-f2e2-ec0ef5427712@img.cas.cz> <7797cde3-d4be-7f37-315d-c4e672c3a49f@nasa.gov> Message-ID: <20180314154237.u4d3hqraqcn6a4xl@utumno.gs.washington.edu> I agree. We have a small Gluster filesystem we use to perform failover of our job scheduler, but it predates our use of GPFS. We've run into a number of strange failures and "soft failures" (i.e. filesystem admin tools don't work but the filesystem is available), and the logging is much more cryptic and jumbled than mmfs.log. We'll soon be retiring it in favor of GPFS. On Wed, Mar 14, 2018 at 11:28:53AM -0400, Aaron Knister wrote: > I don't want to start a religious filesystem war, but I'd give pause to > GlusterFS based on a number of operational issues I've personally > experienced and seen others experience with it. > > I'm curious how glusterfs would resolve the issue here of multiple clients > failing simultaneously (unless you're talking about using disperse volumes)? > That does, actually, bring up an interesting question to IBM which is -- > when will mestor see the light of day? This is admittedly something other > filesystems can do that GPFS cannot. > > -Aaron > > On 3/14/18 6:57 AM, Michal Zacek wrote: > > Hi, > > > > I don't think the GPFS is good choice for your setup. Did you consider > > GlusterFS? It's used at Max Planck Institute at Dresden for HPC > > computing of? Molecular Biology data. They have similar setup,? tens > > (hundreds) of computers with shared local storage in glusterfs. But you > > will need 10Gb network. > > > > Michal > > > > > > Dne 12.3.2018 v 16:23 Lukas Hejtmanek napsal(a): > > > On Mon, Mar 12, 2018 at 11:18:40AM -0400, valdis.kletnieks at vt.edu wrote: > > > > On Mon, 12 Mar 2018 15:51:05 +0100, Lukas Hejtmanek said: > > > > > I don't think like 5 or more data/metadata replicas are practical here. On the > > > > > other hand, multiple node failures is something really expected. > > > > Umm.. do I want to ask *why*, out of only 60 nodes, multiple node > > > > failures are an expected event - to the point that you're thinking > > > > about needing 5 replicas to keep things running? > > > as of my experience with cluster management, we have multiple nodes down on > > > regular basis. (HW failure, SW maintenance and so on.) > > > > > > I'm basically thinking that 2-3 replicas might not be enough while 5 or more > > > are becoming too expensive (both disk space and required bandwidth being > > > scratch space - high i/o load expected). > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From JRLang at uwyo.edu Wed Mar 14 14:11:35 2018 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Wed, 14 Mar 2018 14:11:35 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <20180314092815.k7thtymu33wg65xt@ics.muni.cz> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Mar 14 16:54:16 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 16:54:16 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz>, Message-ID: Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From r.sobey at imperial.ac.uk Wed Mar 14 17:33:02 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 14 Mar 2018 17:33:02 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz>, Message-ID: >> 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed > clustered filesystem made of many unreliable components. You will > need to overprovision your interconnect and will also spend a lot of > time in "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of > nodes and configure those to be more highly available. E.g. of your > 60 nodes, take 8 and put all the storage into those and make that a > dedicated GPFS cluster with no compute jobs on those nodes. Again, > you'll still need really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I > have certainly been in that situation before, where the problem is > more like: "I have a fixed hardware configuration that I can't change, > and I want to try to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is > a "scratch" filesystem and file access is mostly from one node at a > time, it's not very useful to make two additional copies of that data > on other nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute > > > servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every > > compute servers.. Using them as a shared scratch area with GPFS is > > one of the options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for > > caching, do I understand correctly? > > > > > > > > From: gpfsug-discuss-bounces at spectrumscale.org > > > [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. > > > You > > can have up to 3 replicas both data and metadata but still the > > downside, though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh > > > (note: not > > an endorsement since I can't give such things) which can create > > logical volumes across all your NVMe drives. The product has erasure > > coding on their roadmap. I'm not sure if they've released that > > feature yet but in theory it will give better fault tolerance *and* > > you'll get more efficient usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > > > > > wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB > > > interconnected > > > > > > I would like to setup shared scratch area using GPFS and those > > > NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred > > > to > > store > > > data? I.e. node failure most probably does not result in > > > unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared > > > scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Luk?? Hejtm?nek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any > > > attachments > > is confidential and may be privileged, and is for the sole use of > > the intended recipient(s). Any unauthorized review, use, disclosure > > or distribution is prohibited. Unless explicitly stated otherwise in > > the body of this communication or the attachment thereto (if any), > > the information is provided on an AS-IS basis without any express or > > implied warranties or liabilities. To the extent you are relying on > > this information, you are doing so at your own risk. If you are not > > the intended recipient, please notify the sender immediately by > > replying to this message and destroy all copies of this message and > > any attachments. Neither the sender nor the company/group of > > companies he or she represents shall be liable for the proper and > > complete transmission of the information contained in this communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Luk?? Hejtm?nek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ulmer at ulmer.org Wed Mar 14 18:59:29 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 14 Mar 2018 14:59:29 -0400 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz> <20180313141630.k2lqvcndz7mrakgs@ics.muni.cz> <20180314092815.k7thtymu33wg65xt@ics.muni.cz> Message-ID: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen > On Mar 14, 2018, at 1:33 PM, Sobey, Richard A > wrote: > >>> 2. Have data management edition and capacity license the amount of storage. > There goes the budget ? > > Richard > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Simon Thompson (IT Research Support) > Sent: 14 March 2018 16:54 > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > Not always true. > > 1. Use them with socket licenses as HAWC or LROC is OK on a client. > 2. Have data management edition and capacity license the amount of storage. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org ] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu ] > Sent: 14 March 2018 14:11 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD > > Something I haven't heard in this discussion, it that of licensing of GPFS. > > I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Lukas Hejtmanek > Sent: Wednesday, March 14, 2018 4:28 AM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Preferred NSD > > Hello, > > thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. > > On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: >> I would like to discourage you from building a large distributed >> clustered filesystem made of many unreliable components. You will >> need to overprovision your interconnect and will also spend a lot of >> time in "healing" or "degraded" state. >> >> It is typically cheaper to centralize the storage into a subset of >> nodes and configure those to be more highly available. E.g. of your >> 60 nodes, take 8 and put all the storage into those and make that a >> dedicated GPFS cluster with no compute jobs on those nodes. Again, >> you'll still need really beefy and reliable interconnect to make this work. >> >> Stepping back; what is the actual problem you're trying to solve? I >> have certainly been in that situation before, where the problem is >> more like: "I have a fixed hardware configuration that I can't change, >> and I want to try to shoehorn a parallel filesystem onto that." >> >> I would recommend looking closer at your actual workloads. If this is >> a "scratch" filesystem and file access is mostly from one node at a >> time, it's not very useful to make two additional copies of that data >> on other nodes, and it will only slow you down. >> >> Regards, >> Alex >> >> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek >> > >> wrote: >> >>> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: >>>> Lukas, >>>> It looks like you are proposing a setup which uses your compute >>>> servers >>> as storage servers also? >>> >>> yes, exactly. I would like to utilise NVMe SSDs that are in every >>> compute servers.. Using them as a shared scratch area with GPFS is >>> one of the options. >>> >>>> >>>> * I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB >>>> interconnected >>>> >>>> There is nothing wrong with this concept, for instance see >>>> https://www.beegfs.io/wiki/BeeOND >>>> >>>> I have an NVMe filesystem which uses 60 drives, but there are 10 servers. >>>> You should look at "failure zones" also. >>> >>> you still need the storage servers and local SSDs to use only for >>> caching, do I understand correctly? >>> >>>> >>>> From: gpfsug-discuss-bounces at spectrumscale.org >>>> [mailto:gpfsug-discuss- >>> bounces at spectrumscale.org ] On Behalf Of Knister, Aaron S. >>> (GSFC-606.2)[COMPUTER SCIENCE CORP] >>>> Sent: Monday, March 12, 2018 4:14 PM >>>> To: gpfsug main discussion list > >>>> Subject: Re: [gpfsug-discuss] Preferred NSD >>>> >>>> Hi Lukas, >>>> >>>> Check out FPO mode. That mimics Hadoop's data placement features. >>>> You >>> can have up to 3 replicas both data and metadata but still the >>> downside, though, as you say is the wrong node failures will take your cluster down. >>>> >>>> You might want to check out something like Excelero's NVMesh >>>> (note: not >>> an endorsement since I can't give such things) which can create >>> logical volumes across all your NVMe drives. The product has erasure >>> coding on their roadmap. I'm not sure if they've released that >>> feature yet but in theory it will give better fault tolerance *and* >>> you'll get more efficient usage of your SSDs. >>>> >>>> I'm sure there are other ways to skin this cat too. >>>> >>>> -Aaron >>>> >>>> >>>> >>>> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek >>>> >>> >> wrote: >>>> Hello, >>>> >>>> I'm thinking about the following setup: >>>> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB >>>> interconnected >>>> >>>> I would like to setup shared scratch area using GPFS and those >>>> NVMe >>> SSDs. Each >>>> SSDs as on NSD. >>>> >>>> I don't think like 5 or more data/metadata replicas are practical here. >>> On the >>>> other hand, multiple node failures is something really expected. >>>> >>>> Is there a way to instrument that local NSD is strongly preferred >>>> to >>> store >>>> data? I.e. node failure most probably does not result in >>>> unavailable >>> data for >>>> the other nodes? >>>> >>>> Or is there any other recommendation/solution to build shared >>>> scratch >>> with >>>> GPFS in such setup? (Do not do it including.) >>>> >>>> -- >>>> Luk?? Hejtm?nek >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> -- The information contained in this communication and any >>>> attachments >>> is confidential and may be privileged, and is for the sole use of >>> the intended recipient(s). Any unauthorized review, use, disclosure >>> or distribution is prohibited. Unless explicitly stated otherwise in >>> the body of this communication or the attachment thereto (if any), >>> the information is provided on an AS-IS basis without any express or >>> implied warranties or liabilities. To the extent you are relying on >>> this information, you are doing so at your own risk. If you are not >>> the intended recipient, please notify the sender immediately by >>> replying to this message and destroy all copies of this message and >>> any attachments. Neither the sender nor the company/group of >>> companies he or she represents shall be liable for the proper and >>> complete transmission of the information contained in this communication, or for any delay in its receipt. >>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> -- >>> Luk?? Hejtm?nek >>> >>> Linux Administrator only because >>> Full Time Multitasking Ninja >>> is not an official job title >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Luk?? Hejtm?nek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Mar 14 19:23:18 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 14 Mar 2018 14:23:18 -0500 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Message-ID: My understanding is that with Spectrum Scale 5.0 there is no longer a standard edition, only data management and advanced, and the pricing is all done via storage not sockets. Now there may be some grandfathering for those with existing socket licenses but I really do not know. My point is that data management is not the same as advanced edition. Again I could be wrong because I tend not to concern myself with how the product is licensed. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Stephen Ulmer To: gpfsug main discussion list Date: 03/14/2018 03:06 PM Subject: Re: [gpfsug-discuss] Preferred NSD Sent by: gpfsug-discuss-bounces at spectrumscale.org Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen On Mar 14, 2018, at 1:33 PM, Sobey, Richard A wrote: 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org < gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [ JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org < gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek wrote: On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=kB88vNQV9x5UFOu3tBxpRKmS3rSCi68KIBxOa_D5ji8&s=R9wxUL1IMkjtWZsFkSAXRUmuKi8uS1jpQRYVTvOYq3g&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Mar 14 19:27:57 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 14 Mar 2018 19:27:57 +0000 Subject: [gpfsug-discuss] Preferred NSD In-Reply-To: References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org>, Message-ID: I don't think this is correct. My understanding is: There is no longer express edition. Grand fathered to standard. Standard edition (sockets) remains. Advanced edition (sockets) is available for existing advanced customers only. Grand fathering to DME available. Data management (mostly capacity but per disk in ESS and DSS-G configs, different cost for flash or spinning drives). I'm sure Carl can correct me if I'm wrong here. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of stockf at us.ibm.com [stockf at us.ibm.com] Sent: 14 March 2018 19:23 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD My understanding is that with Spectrum Scale 5.0 there is no longer a standard edition, only data management and advanced, and the pricing is all done via storage not sockets. Now there may be some grandfathering for those with existing socket licenses but I really do not know. My point is that data management is not the same as advanced edition. Again I could be wrong because I tend not to concern myself with how the product is licensed. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Stephen Ulmer To: gpfsug main discussion list Date: 03/14/2018 03:06 PM Subject: Re: [gpfsug-discuss] Preferred NSD Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Depending on the size... I just quoted something both ways and DME (which is Advanced Edition equivalent) was about $400K cheaper than Standard Edition socket pricing for this particular customer and use case. It all depends. Also, for the case where the OP wants to distribute the file system around on NVMe in *every* node, there is always the FPO license. The FPO license can share NSDs with other FPO licensed nodes and servers (just not clients). -- Stephen On Mar 14, 2018, at 1:33 PM, Sobey, Richard A > wrote: 2. Have data management edition and capacity license the amount of storage. There goes the budget ? Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson (IT Research Support) Sent: 14 March 2018 16:54 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Not always true. 1. Use them with socket licenses as HAWC or LROC is OK on a client. 2. Have data management edition and capacity license the amount of storage. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jeffrey R. Lang [JRLang at uwyo.edu] Sent: 14 March 2018 14:11 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Preferred NSD Something I haven't heard in this discussion, it that of licensing of GPFS. I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server. There goes the budget. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek Sent: Wednesday, March 14, 2018 4:28 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes. On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state. It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek > wrote: On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: Lukas, It looks like you are proposing a setup which uses your compute servers as storage servers also? yes, exactly. I would like to utilise NVMe SSDs that are in every compute servers.. Using them as a shared scratch area with GPFS is one of the options. * I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected There is nothing wrong with this concept, for instance see https://www.beegfs.io/wiki/BeeOND I have an NVMe filesystem which uses 60 drives, but there are 10 servers. You should look at "failure zones" also. you still need the storage servers and local SSDs to use only for caching, do I understand correctly? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- bounces at spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Monday, March 12, 2018 4:14 PM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Preferred NSD Hi Lukas, Check out FPO mode. That mimics Hadoop's data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero's NVMesh (note: not an endorsement since I can't give such things) which can create logical volumes across all your NVMe drives. The product has erasure coding on their roadmap. I'm not sure if they've released that feature yet but in theory it will give better fault tolerance *and* you'll get more efficient usage of your SSDs. I'm sure there are other ways to skin this cat too. -Aaron On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek > wrote: Hello, I'm thinking about the following setup: ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each SSDs as on NSD. I don't think like 5 or more data/metadata replicas are practical here. On the other hand, multiple node failures is something really expected. Is there a way to instrument that local NSD is strongly preferred to store data? I.e. node failure most probably does not result in unavailable data for the other nodes? Or is there any other recommendation/solution to build shared scratch with GPFS in such setup? (Do not do it including.) -- Luk?? Hejtm?nek _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=kB88vNQV9x5UFOu3tBxpRKmS3rSCi68KIBxOa_D5ji8&s=R9wxUL1IMkjtWZsFkSAXRUmuKi8uS1jpQRYVTvOYq3g&e= From makaplan at us.ibm.com Wed Mar 14 20:02:15 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 14 Mar 2018 15:02:15 -0500 Subject: [gpfsug-discuss] Editions and Licensing / Preferred NSD In-Reply-To: <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> References: <20180312145105.yb24bzo6chlpthwm@ics.muni.cz><20180313141630.k2lqvcndz7mrakgs@ics.muni.cz><20180314092815.k7thtymu33wg65xt@ics.muni.cz> <11B42072-E3CA-4FCE-BD1C-1E3DCA16626A@ulmer.org> Message-ID: Thread seems to have gone off on a product editions and Licensing tangents -- refer to IBM website for official statements: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1in_IntroducingIBMSpectrumScale.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Wed Mar 14 15:36:32 2018 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Wed, 14 Mar 2018 15:36:32 +0000 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Message-ID: Is it possible (albeit not advisable) to mirror LUNs that are NSD's to another storage array in another site basically for DR purposes? Once it's mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Wed Mar 14 20:31:01 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Wed, 14 Mar 2018 20:31:01 +0000 Subject: [gpfsug-discuss] Editions and Licensing / Preferred NSD In-Reply-To: References: Message-ID: Simon's description is correct. For those who don't have it readily to hand I'll reiterate it here (in my own words): We discontinued Express a while back; everybody on that edition got a free upgrade to Standard. Standard continues to be licensed on sockets. This has certain advantages (clients and FPOs nodes are cheap, but as noted in the thread if you need to change them to servers, they get more expensive) Advanced was retired; those already on it were "grandfathered in" can continue to buy it, so no forced conversion. But no new customers. In place of Advanced, Data Management Edition is licensed by the TiB. This has the advantage of simplicity -- it is completely flat regardless of topology. It also allows you to add and subtract nodes, including clients, or change a client node to a server node, at will without having to go through a licensing transaction or keep count of clients or pay a penalty for putting clients in a separate compute cluster or ... BTW, I'll be at the UG in London and (probably) in Boston, if anybody wants to talk licensing... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com ********************************************** From olaf.weiser at de.ibm.com Wed Mar 14 23:19:03 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Mar 2018 00:19:03 +0100 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From secretary at gpfsug.org Thu Mar 15 10:00:08 2018 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 15 Mar 2018 10:00:08 +0000 Subject: [gpfsug-discuss] Meetup at the IBM System Z Technical University Message-ID: <738c1046e602fb96e1dc6e5772c0a65a@webmail.gpfsug.org> Dear members, We have another meet up opportunity for you! There's a Spectrum Scale Meet Up taking place at the System Z Technical University on 14th May in London. It's free to attend and is an ideal opportunity to learn about Spectrum Scale on IBM Z in particular and hear from the UK Met Office. Please email your registration to Par Hettinga par at nl.ibm.com and if you have any questions, please contact Par. Date: Monday 14th May 2018 Time: 4.15pm - 6:15 PM Agenda: 16.15 - Welcome & Introductions 16.25 - IBM Spectrum Scale and Industry Use Cases for IBM System Z 17.10 - UK Met Office - Why IBM Spectrum Scale with System Z 17.40 - Spectrum Scale on IBM Z 18.10 - Questions & Close 18.15 - Drinks & Networking Location: Room B4 Beaujolais Novotel London West 1 Shortlands London W6 8DR United Kingdom 020 7660 0680 Thanks, -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Mar 15 14:57:41 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 15 Mar 2018 09:57:41 -0500 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: Does the mirrored-storage vendor guarantee the sequence of all writes to all the LUNs at the remote-site exactly matches the sequence of writes to the local site....? If not.. the file system on the remote-site could be left in an inconsistent state when the communications line is cut... Guaranteeing sequencing to each LUN is not sufficient, because a typical GPFS file system has its data and metadata spread over several LUNs. From: "Olaf Weiser" To: gpfsug main discussion list Date: 03/14/2018 07:19 PM Subject: Re: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org HI Mark.. yes.. that's possible... at least , I'm sure.. there was a chapter in the former advanced admin guide of older releases with PPRC .. how to do that.. similar to PPRC , you might use other methods , but from gpfs perspective this should'nt make a difference.. and I had have a german customer, who was doing this for years... (but it is some years back meanwhile ... hihi time flies...) From: Mark Bush To: "gpfsug-discuss at spectrumscale.org" Date: 03/14/2018 09:11 PM Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org Is it possible (albeit not advisable) to mirror LUNs that are NSD?s to another storage array in another site basically for DR purposes? Once it?s mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=vq-nGaYTObfhVeW9E8fpLCJ9MIi9SNCiO5yYfXwJWhY&s=9o--h1_iFfwOmI2jRmxRjZSJX7IfQSFwUi6AfFhEas0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Thu Mar 15 15:07:30 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Thu, 15 Mar 2018 11:07:30 -0400 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: <26547.1521126450@turing-police.cc.vt.edu> On Wed, 14 Mar 2018 15:36:32 -0000, Mark Bush said: > Is it possible (albeit not advisable) to mirror LUNs that are NSD's to > another storage array in another site basically for DR purposes? Once it's > mirrored to a new cluster elsewhere what would be the step to get the > filesystem back up and running. I know that AFM-DR is meant for this but in > this case my client only has Standard edition and has mirroring software > purchased with the underlying disk array. > Is this even doable? We had a discussion on the list about this recently. The upshot is that it's sort of doable, but depends on what failure modes you're trying to protect against. The basic problem is that if you're doing mirroring at the array level, there's a certain amount of skew delay where GPFS has written stuff on the local disk and it hasn't been copied to the remote disk (basically the same reason why running fsck on a mounted disk partition can be problematic). There's also issues if things are scribbling on the local file system and generating enough traffic to saturate the network link you're doing the mirroring over, for a long enough time to overwhelm the mirroring mechanism (both sync and async mirroring have their good and bad sides in that scenario) We're using a stretch cluster with GPFS replication to storage about 95 cable miles away - that has the advantage that then GPFS knows there's a remote replica and can take more steps to make sure the remote copy is consistent. In particular, if it knows there's replication that needs to be done and it's getting backlogged, it can present a slow-down to the local writers and ensure that the remote set of disks don't fall too far behind.... (There's some funkyness having to do with quorum - it's *really* hard to set up so you have both protection against split-brain and the ability to start up the remote site stand-alone - mostly because from the remote point of view, starting up stand-alone after the main site fails looks identical to split-brain) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From janfrode at tanso.net Thu Mar 15 17:12:23 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 15 Mar 2018 18:12:23 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: > You could also use the GPFS prestartup callback (mmaddcallback) to execute > a script synchronously that waits for the IB ports to become available > before returning and allowing GPFS to continue. Not systemd integrated but > it should work. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 <(720)%20430-8821> > stockf at us.ibm.com > > > > From: david_johnson at brown.edu > To: gpfsug main discussion list > Date: 03/08/2018 07:34 AM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > become active > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Until IBM provides a solution, here is my workaround. Add it so it runs > before the gpfs script, I call it from our custom xcat diskless boot > scripts. Based on rhel7, not fully systemd integrated. YMMV! > > Regards, > ? ddj > ??- > [ddj at storage041 ~]$ cat /etc/init.d/ibready > #! /bin/bash > # > # chkconfig: 2345 06 94 > # /etc/rc.d/init.d/ibready > # written in 2016 David D Johnson (ddj *brown.edu* > > ) > # > ### BEGIN INIT INFO > # Provides: ibready > # Required-Start: > # Required-Stop: > # Default-Stop: > # Description: Block until infiniband is ready > # Short-Description: Block until infiniband is ready > ### END INIT INFO > > RETVAL=0 > if [[ -d /sys/class/infiniband ]] > then > IBDEVICE=$(dirname $(grep -il infiniband > /sys/class/infiniband/*/ports/1/link* | head -n 1)) > fi > # See how we were called. > case "$1" in > start) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo -n "Polling for InfiniBand link up: " > for (( count = 60; count > 0; count-- )) > do > if grep -q ACTIVE $IBDEVICE/state > then > echo ACTIVE > break > fi > echo -n "." > sleep 5 > done > if (( count <= 0 )) > then > echo DOWN - $0 timed out > fi > fi > ;; > stop|restart|reload|force-reload|condrestart|try-restart) > ;; > status) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo "$IBDEVICE is $(< $IBDEVICE/state) $(< > $IBDEVICE/rate)" > else > echo "No IBDEVICE found" > fi > ;; > *) > echo "Usage: ibready {start|stop|status|restart| > reload|force-reload|condrestart|try-restart}" > exit 2 > esac > exit ${RETVAL} > ???? > > -- ddj > Dave Johnson > > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) < > *marc.caubet at psi.ch* > wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the > IB link becomes up. Is there a way to force GPFS waiting to start until IB > ports are up? This can be probably done by adding something like > After=network-online.target and Wants=network-online.target in the systemd > file but I would like to know if this is natively possible from the GPFS > configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 <+41%2056%20310%2046%2067> > E-Mail: *marc.caubet at psi.ch* > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_ > iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_ > Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqF > yIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Thu Mar 15 17:23:38 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Thu, 15 Mar 2018 12:23:38 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: The callback is the only way I know to use the "--onerror shutdown" option. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/15/2018 01:14 PM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=79jdzLLNtYEi36P6EifUd1cEI2GcLu2QWCwYwln12xg&s=AgoxRgQ2Ht0ZWCfogYsyg72RZn33CfTEyW7h1JQWRrM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Mar 15 17:30:49 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Mar 2018 18:30:49 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: An HTML attachment was scrubbed... URL: From chris.schlipalius at pawsey.org.au Fri Mar 16 06:11:39 2018 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Fri, 16 Mar 2018 14:11:39 +0800 Subject: [gpfsug-discuss] Reminder, 2018 March 26th Singapore Spectrum Scale User Group event is on soon. In-Reply-To: <54D1282A-175F-42AD-8BCA-DDA48326B80C@pawsey.org.au> References: <54D1282A-175F-42AD-8BCA-DDA48326B80C@pawsey.org.au> Message-ID: <988B0149-D942-41AD-93B9-E9A0ACAF7D9F@pawsey.org.au> Hello, This is a reminder for the the inaugural Spectrum Scale Usergroup Singapore on Monday 26th March 2018, Sentosa, Singapore. This event occurs just before SCA18 starts and is being held in conjunction with SCA18 https://sc-asia.org/ All current Singapore Spectrum Scale User Group event details can be found here: http://goo.gl/dXtqvS Feel free to circulate this event link to all that may need it. Please reserve your tickets now as tickets for places will close soon. There are some great speakers and topics, for details please see the agenda on Eventbrite. We are looking forwards to a great new Usergroup in a fabulous venue. Thanks again to NSCC and IBM for helping to arrange the venue and event booking. Regards, Chris Schlipalius IBM Champion 2018 Team Lead, Storage Infrastructure, Data & Visualisation, The Pawsey Supercomputing Centre (CSIRO) 13 Burvill Court Kensington WA 6151 Australia Tel +61 8 6436 8815 Email chris.schlipalius at pawsey.org.au Web www.pawsey.org.au From janfrode at tanso.net Fri Mar 16 08:29:59 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 16 Mar 2018 09:29:59 +0100 Subject: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch> <4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: Thanks Olaf, but we don't use NetworkManager on this cluster.. I now created this simple script: ------------------------------------------------------------------------------------------------------------------------------------------------------------- #! /bin/bash - # # Fail mmstartup if not all configured IB ports are active. # # Install with: # # mmaddcallback fail-if-ibfail --command /var/mmfs/etc/fail-if-ibfail --event preStartup --sync --onerror shutdown # for port in $(/usr/lpp/mmfs/bin/mmdiag --config|grep verbsPorts | cut -f 4- -d " ") do grep -q ACTIVE /sys/class/infiniband/${port%/*}/ports/${port##*/}/state || exit 1 done ------------------------------------------------------------------------------------------------------------------------------------------------------------- which I haven't tested, but assume should work. Suggestions for improvements would be much appreciated! -jf On Thu, Mar 15, 2018 at 6:30 PM, Olaf Weiser wrote: > > you can try : > systemctl enable NetworkManager-wait-online > ln -s '/usr/lib/systemd/system/NetworkManager-wait-online.service' > '/etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online. > service' > > in many cases .. it helps .. > > > > > > From: Jan-Frode Myklebust > To: gpfsug main discussion list > Date: 03/15/2018 06:18 PM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > becomeactive > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > I found some discussion on this at > *https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25* > and > there it's claimed that none of the callback events are early enough to > resolve this. That we need a pre-preStartup trigger. Any idea if this has > changed -- or is the callback option then only to do a "--onerror > shutdown" if it has failed to connect IB ? > > > On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock <*stockf at us.ibm.com* > > wrote: > You could also use the GPFS prestartup callback (mmaddcallback) to execute > a script synchronously that waits for the IB ports to become available > before returning and allowing GPFS to continue. Not systemd integrated but > it should work. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | *720-430-8821* <(720)%20430-8821> > *stockf at us.ibm.com* > > > > From: *david_johnson at brown.edu* > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > > > Date: 03/08/2018 07:34 AM > Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to > become active > Sent by: *gpfsug-discuss-bounces at spectrumscale.org* > > ------------------------------ > > > > > Until IBM provides a solution, here is my workaround. Add it so it runs > before the gpfs script, I call it from our custom xcat diskless boot > scripts. Based on rhel7, not fully systemd integrated. YMMV! > > Regards, > ? ddj > ??- > [ddj at storage041 ~]$ cat /etc/init.d/ibready > #! /bin/bash > # > # chkconfig: 2345 06 94 > # /etc/rc.d/init.d/ibready > # written in 2016 David D Johnson (ddj *brown.edu* > > ) > # > ### BEGIN INIT INFO > # Provides: ibready > # Required-Start: > # Required-Stop: > # Default-Stop: > # Description: Block until infiniband is ready > # Short-Description: Block until infiniband is ready > ### END INIT INFO > > RETVAL=0 > if [[ -d /sys/class/infiniband ]] > then > IBDEVICE=$(dirname $(grep -il infiniband > /sys/class/infiniband/*/ports/1/link* | head -n 1)) > fi > # See how we were called. > case "$1" in > start) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo -n "Polling for InfiniBand link up: " > for (( count = 60; count > 0; count-- )) > do > if grep -q ACTIVE $IBDEVICE/state > then > echo ACTIVE > break > fi > echo -n "." > sleep 5 > done > if (( count <= 0 )) > then > echo DOWN - $0 timed out > fi > fi > ;; > stop|restart|reload|force-reload|condrestart|try-restart) > ;; > status) > if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] > then > echo "$IBDEVICE is $(< $IBDEVICE/state) $(< > $IBDEVICE/rate)" > else > echo "No IBDEVICE found" > fi > ;; > *) > echo "Usage: ibready {start|stop|status|restart| > reload|force-reload|condrestart|try-restart}" > exit 2 > esac > exit ${RETVAL} > ???? > > -- ddj > Dave Johnson > > On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) < > *marc.caubet at psi.ch* > wrote: > > Hi all, > > with autoload = yes we do not ensure that GPFS will be started after the > IB link becomes up. Is there a way to force GPFS waiting to start until IB > ports are up? This can be probably done by adding something like > After=network-online.target and Wants=network-online.target in the systemd > file but I would like to know if this is natively possible from the GPFS > configuration. > > Thanks a lot, > Marc > _________________________________________ > Paul Scherrer Institut > High Performance Computing > Marc Caubet Serrabou > WHGA/036 > 5232 Villigen PSI > Switzerland > > Telephone: *+41 56 310 46 67* <+41%2056%20310%2046%2067> > E-Mail: *marc.caubet at psi.ch* > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > > *https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e=* > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at *spectrumscale.org* > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Fri Mar 16 08:46:37 2018 From: YARD at il.ibm.com (Yaron Daniel) Date: Fri, 16 Mar 2018 10:46:37 +0200 Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact In-Reply-To: References: Message-ID: Hi You can have few options: 1) Active/Active GPFS sites - with sync replication of the storage - take into account the latency you have. 2) Active/StandBy Gpfs sites- with a-sync replication of the storage. All info can be found at : https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adv_continous_replication_SSdata.htm Synchronous mirroring with GPFS replication In a configuration utilizing GPFS? replication, a single GPFS cluster is defined over three geographically-separate sites consisting of two production sites and a third tiebreaker site. One or more file systems are created, mounted, and accessed concurrently from the two active production sites. Synchronous mirroring utilizing storage based replication This topic describes synchronous mirroring utilizing storage-based replication. Point In Time Copy of IBM Spectrum Scale data Most storage systems provides functionality to make a point-in-time copy of data as an online backup mechanism. This function provides an instantaneous copy of the original data on the target disk, while the actual copy of data takes place asynchronously and is fully transparent to the user. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Storage Architect Petach Tiqva, 49527 IBM Global Markets, Systems HW Sales Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Mark Bush To: "gpfsug-discuss at spectrumscale.org" Date: 03/14/2018 10:10 PM Subject: [gpfsug-discuss] Underlying LUN mirroring NSD impact Sent by: gpfsug-discuss-bounces at spectrumscale.org Is it possible (albeit not advisable) to mirror LUNs that are NSD?s to another storage array in another site basically for DR purposes? Once it?s mirrored to a new cluster elsewhere what would be the step to get the filesystem back up and running. I know that AFM-DR is meant for this but in this case my client only has Standard edition and has mirroring software purchased with the underlying disk array. Is this even doable? Mark _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=c9HNr6pLit8n4hQKpcYyyRg9ZnITpo_2OiEx6hbukYA&s=qFgC1ebi1SJvnCRlc92cI4hZqZYpK7EneZ0Sati5s5E&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4376 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 5093 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4746 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 4557 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 11294 bytes Desc: not available URL: From stockf at us.ibm.com Fri Mar 16 12:05:29 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 16 Mar 2018 07:05:29 -0500 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: I have my doubts that mmdiag can be used in this script. In general the guidance is to avoid or be very careful with mm* commands in a callback due to the potential for deadlock. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/16/2018 04:30 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports tobecomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks Olaf, but we don't use NetworkManager on this cluster.. I now created this simple script: ------------------------------------------------------------------------------------------------------------------------------------------------------------- #! /bin/bash - # # Fail mmstartup if not all configured IB ports are active. # # Install with: # # mmaddcallback fail-if-ibfail --command /var/mmfs/etc/fail-if-ibfail --event preStartup --sync --onerror shutdown # for port in $(/usr/lpp/mmfs/bin/mmdiag --config|grep verbsPorts | cut -f 4- -d " ") do grep -q ACTIVE /sys/class/infiniband/${port%/*}/ports/${port##*/}/state || exit 1 done ------------------------------------------------------------------------------------------------------------------------------------------------------------- which I haven't tested, but assume should work. Suggestions for improvements would be much appreciated! -jf On Thu, Mar 15, 2018 at 6:30 PM, Olaf Weiser wrote: you can try : systemctl enable NetworkManager-wait-online ln -s '/usr/lib/systemd/system/NetworkManager-wait-online.service' '/etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online.service' in many cases .. it helps .. From: Jan-Frode Myklebust To: gpfsug main discussion list Date: 03/15/2018 06:18 PM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to becomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org I found some discussion on this at https://www.ibm.com/developerworks/community/forums/html/threadTopic?id=77777777-0000-0000-0000-000014471957&ps=25 and there it's claimed that none of the callback events are early enough to resolve this. That we need a pre-preStartup trigger. Any idea if this has changed -- or is the callback option then only to do a "--onerror shutdown" if it has failed to connect IB ? On Thu, Mar 8, 2018 at 1:42 PM, Frederick Stock wrote: You could also use the GPFS prestartup callback (mmaddcallback) to execute a script synchronously that waits for the IB ports to become available before returning and allowing GPFS to continue. Not systemd integrated but it should work. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: david_johnson at brown.edu To: gpfsug main discussion list Date: 03/08/2018 07:34 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB ports to become active Sent by: gpfsug-discuss-bounces at spectrumscale.org Until IBM provides a solution, here is my workaround. Add it so it runs before the gpfs script, I call it from our custom xcat diskless boot scripts. Based on rhel7, not fully systemd integrated. YMMV! Regards, ? ddj ??- [ddj at storage041 ~]$ cat /etc/init.d/ibready #! /bin/bash # # chkconfig: 2345 06 94 # /etc/rc.d/init.d/ibready # written in 2016 David D Johnson (ddj brown.edu) # ### BEGIN INIT INFO # Provides: ibready # Required-Start: # Required-Stop: # Default-Stop: # Description: Block until infiniband is ready # Short-Description: Block until infiniband is ready ### END INIT INFO RETVAL=0 if [[ -d /sys/class/infiniband ]] then IBDEVICE=$(dirname $(grep -il infiniband /sys/class/infiniband/*/ports/1/link* | head -n 1)) fi # See how we were called. case "$1" in start) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo -n "Polling for InfiniBand link up: " for (( count = 60; count > 0; count-- )) do if grep -q ACTIVE $IBDEVICE/state then echo ACTIVE break fi echo -n "." sleep 5 done if (( count <= 0 )) then echo DOWN - $0 timed out fi fi ;; stop|restart|reload|force-reload|condrestart|try-restart) ;; status) if [[ -n $IBDEVICE && -f $IBDEVICE/state ]] then echo "$IBDEVICE is $(< $IBDEVICE/state) $(< $IBDEVICE/rate)" else echo "No IBDEVICE found" fi ;; *) echo "Usage: ibready {start|stop|status|restart|reload|force-reload|condrestart|try-restart}" exit 2 esac exit ${RETVAL} ???? -- ddj Dave Johnson On Mar 8, 2018, at 6:10 AM, Caubet Serrabou Marc (PSI) wrote: Hi all, with autoload = yes we do not ensure that GPFS will be started after the IB link becomes up. Is there a way to force GPFS waiting to start until IB ports are up? This can be probably done by adding something like After=network-online.target and Wants=network-online.target in the systemd file but I would like to know if this is natively possible from the GPFS configuration. Thanks a lot, Marc _________________________________________ Paul Scherrer Institut High Performance Computing Marc Caubet Serrabou WHGA/036 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=u-EMob09-dkE6jZbD3dTjBi3vWhmDXtxiOK3nqFyIgY&s=JCfJgq6pZnKUI6d-rIgJXVcdZh7vmA5ypB1_goP_FFA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=xImYTxt4pm1o5znVn5Vdoka2uxgsTRpmlCGdEWhB9vw&s=veOZZz80aBzoCTKusx6WOpVlYs64eNkp5pM9kbHgvic&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Fri Mar 16 14:25:52 2018 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Fri, 16 Mar 2018 14:25:52 +0000 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads Message-ID: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> Hello GPFS Team, We are observing strange behavior of GPFS during startup on SLES12 node. In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base and when GPFS starts for the first time on this node, it complains about too little NSD threads: .. 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.2.3.7 Built: Feb 15 2018 11:38:38} ... 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... .. 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... 2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 more threads, exceeds max thread count 1024 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down. 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not initialize network shared disks 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 2018-03-16_13:11:30.701+0100: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds before restarting mmfsd 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup GPFS starts loop and tries to respawn mmfsd periodically: 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds before restarting mmfsd It seems that this issue can be resolved by doing mmshutdown. Later, when we manually perform mmstartup the problem is gone. We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11 SP4. Only on VLP1 we installed SLES12 SP3. The test cluster looks as below: Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------- 1 VLP0.cs-intern 192.168.101.210 VLP0.cs-intern quorum-manager-snmp_collector 2 VLP1.cs-intern 192.168.101.211 VLP1.cs-intern quorum-manager 3 TBP0.cs-intern 192.168.101.215 TBP0.cs-intern quorum 4 IDP0.cs-intern 192.168.101.110 IDP0.cs-intern 5 IDP1.cs-intern 192.168.101.111 IDP1.cs-intern 6 IDP2.cs-intern 192.168.101.112 IDP2.cs-intern 7 IDP3.cs-intern 192.168.101.113 IDP3.cs-intern 8 ICP0.cs-intern 192.168.101.10 ICP0.cs-intern 9 ICP1.cs-intern 192.168.101.11 ICP1.cs-intern 10 ICP2.cs-intern 192.168.101.12 ICP2.cs-intern 11 ICP3.cs-intern 192.168.101.13 ICP3.cs-intern 12 ICP4.cs-intern 192.168.101.14 ICP4.cs-intern 13 ICP5.cs-intern 192.168.101.15 ICP5.cs-intern We have enabled traces and reproduced the issue as follows: 1. When GPFS daemon was in a respawn loop, we have started traces, all files from this period you can find in uploaded archive under 1_nsd_threads_problem directory 2. We have manually stopped the "respawn" loop on VLP1 by executing mmshutdown and start GPFS manually by mmstartup. All traces from this execution can be found in archive file under 2_mmshutdown_mmstartup directory All data related to this problem is uploaded to our ftp to file: ftp.ts.fujitsu.com/CS-Diagnose/IBM, (fe_cs_oem, 12Monkeys) item435_nsd_threads.tar.gz Could you please have a look at this problem? Best regards, Tomasz Wolski -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Mar 16 14:52:11 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 16 Mar 2018 10:52:11 -0400 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads In-Reply-To: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> References: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> Message-ID: <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> Ah. You, my friend, have been struck by a smooth criminal. And by smooth criminal I mean systemd. I ran into this last week and spent many hours banging my head against the wall trying to figure it out. systemd by default limits cgroups to I think 512 tasks and since a thread counts as a task that's likely what you're running into. Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then reboot (and yes, I mean reboot. changing it live doesn't seem possible because of the infinite wisdom of the systemd developers). The pid limit of a given slice/unit cgroup may already be overriden to something more reasonable than the 512 default so if, for example, you were logging in and startng it via ssh the limit may be different than if its started from the gpfs.service unit because mmfsd effectively is running in different cgroups in each case. Hope that helps! -Aaron On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote: > Hello GPFS Team, > > We are observing strange behavior of GPFS during startup on SLES12 node. > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base > and when GPFS starts for the first time on this node, it complains about > > too little NSD threads: > > .. > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > {Version: 4.2.3.7?? Built: Feb 15 2018 11:38:38} ... > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > .. > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > more threads, exceeds max thread count 1024_* > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down. > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > initialize network shared disks > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > Removing old /var/adm/ras/mmfs.log.* files: > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > before restarting mmfsd > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > GPFS starts loop and tries to respawn mmfsd periodically: > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > before restarting mmfsd_* > > It seems that this issue can be resolved by doing mmshutdown. Later, > when we manually perform mmstartup the problem is gone. > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11 > SP4. Only on VLP1 we installed SLES12 SP3. > > The test cluster looks as below: > > Node? Daemon node name? IP address?????? Admin node name? Designation > > ----------------------------------------------------------------------- > > ?? 1?? VLP0.cs-intern??? 192.168.101.210? VLP0.cs-intern > quorum-manager-snmp_collector > > ?? 2?? VLP1.cs-intern??? 192.168.101.211? VLP1.cs-intern?? quorum-manager > > ?? 3?? TBP0.cs-intern??? 192.168.101.215? TBP0.cs-intern?? quorum > > ?? 4?? IDP0.cs-intern??? 192.168.101.110? IDP0.cs-intern > > ?? 5?? IDP1.cs-intern??? 192.168.101.111? IDP1.cs-intern > > ?? 6?? IDP2.cs-intern??? 192.168.101.112? IDP2.cs-intern > > ?? 7?? IDP3.cs-intern??? 192.168.101.113? IDP3.cs-intern > > ?? 8?? ICP0.cs-intern??? 192.168.101.10?? ICP0.cs-intern > > ?? 9?? ICP1.cs-intern??? 192.168.101.11?? ICP1.cs-intern > > ? 10?? ICP2.cs-intern??? 192.168.101.12?? ICP2.cs-intern > > ? 11?? ICP3.cs-intern??? 192.168.101.13?? ICP3.cs-intern > > ? 12?? ICP4.cs-intern??? 192.168.101.14?? ICP4.cs-intern > > ? 13?? ICP5.cs-intern??? 192.168.101.15?? ICP5.cs-intern > > We have enabled traces and reproduced the issue as follows: > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > files from this period you can find in uploaded archive under > *_1_nsd_threads_problem_* directory > > 2.We have manually stopped the ?respawn? loop on VLP1 by executing > mmshutdown and start GPFS manually by mmstartup. All traces from this > execution can be found in archive file under *_2_mmshutdown_mmstartup > _*directory > > All data related to this problem is uploaded to our ftp to file: > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > , (fe_cs_oem, 12Monkeys) > item435_nsd_threads.tar.gz > > Could you please have a look at this problem? > > Best regards, > > Tomasz Wolski > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Tomasz.Wolski at ts.fujitsu.com Fri Mar 16 15:01:08 2018 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Fri, 16 Mar 2018 15:01:08 +0000 Subject: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads In-Reply-To: <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> References: <1843bf34db594a0b8f7c85d52b1e0e28@R01UKEXCASM223.r01.fujitsu.local> <46b43ff1-a8b4-649f-0304-ccae73d5851b@nasa.gov> Message-ID: <679be18ca4ea4a29b0ba8cb5f49d0f1b@R01UKEXCASM223.r01.fujitsu.local> Hi Aaron, Thanks for the hint! :) Best regards, Tomasz Wolski > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Aaron Knister > Sent: Friday, March 16, 2018 3:52 PM > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread > configuration needs more threads > > Ah. You, my friend, have been struck by a smooth criminal. And by smooth > criminal I mean systemd. I ran into this last week and spent many hours > banging my head against the wall trying to figure it out. > > systemd by default limits cgroups to I think 512 tasks and since a thread > counts as a task that's likely what you're running into. > > Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then > reboot (and yes, I mean reboot. changing it live doesn't seem possible > because of the infinite wisdom of the systemd developers). > > The pid limit of a given slice/unit cgroup may already be overriden to > something more reasonable than the 512 default so if, for example, you > were logging in and startng it via ssh the limit may be different than if its > started from the gpfs.service unit because mmfsd effectively is running in > different cgroups in each case. > > Hope that helps! > > -Aaron > > On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote: > > Hello GPFS Team, > > > > We are observing strange behavior of GPFS during startup on SLES12 node. > > > > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a > > base and when GPFS starts for the first time on this node, it > > complains about > > > > too little NSD threads: > > > > .. > > > > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing. > > {Version: 4.2.3.7?? Built: Feb 15 2018 11:38:38} ... > > > > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ... > > > > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ... > > > > .. > > > > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ... > > > > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ... > > > > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ... > > > > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413 > > more threads, exceeds max thread count 1024_* > > > > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting > down. > > > > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not > > initialize network shared disks > > > > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11 > > > > 2018-03-16_13:11:30.701+0100: runmmfs starting > > > > Removing old /var/adm/ras/mmfs.log.* files: > > > > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds > > before restarting mmfsd > > > > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup: > > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup > > > > GPFS starts loop and tries to respawn mmfsd periodically: > > > > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 > seconds > > before restarting mmfsd_* > > > > It seems that this issue can be resolved by doing mmshutdown. Later, > > when we manually perform mmstartup the problem is gone. > > > > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running > > SLES11 SP4. Only on VLP1 we installed SLES12 SP3. > > > > The test cluster looks as below: > > > > Node? Daemon node name? IP address?????? Admin node name? Designation > > > > ---------------------------------------------------------------------- > > - > > > > ?? 1?? VLP0.cs-intern??? 192.168.101.210? VLP0.cs-intern > > quorum-manager-snmp_collector > > > > ?? 2?? VLP1.cs-intern??? 192.168.101.211? VLP1.cs-intern > > quorum-manager > > > > ?? 3?? TBP0.cs-intern??? 192.168.101.215? TBP0.cs-intern?? quorum > > > > ?? 4?? IDP0.cs-intern??? 192.168.101.110? IDP0.cs-intern > > > > ?? 5?? IDP1.cs-intern??? 192.168.101.111? IDP1.cs-intern > > > > ?? 6?? IDP2.cs-intern??? 192.168.101.112? IDP2.cs-intern > > > > ?? 7?? IDP3.cs-intern??? 192.168.101.113? IDP3.cs-intern > > > > ?? 8?? ICP0.cs-intern??? 192.168.101.10?? ICP0.cs-intern > > > > ?? 9?? ICP1.cs-intern??? 192.168.101.11?? ICP1.cs-intern > > > > ? 10?? ICP2.cs-intern??? 192.168.101.12?? ICP2.cs-intern > > > > ? 11?? ICP3.cs-intern??? 192.168.101.13?? ICP3.cs-intern > > > > ? 12?? ICP4.cs-intern??? 192.168.101.14?? ICP4.cs-intern > > > > ? 13?? ICP5.cs-intern??? 192.168.101.15?? ICP5.cs-intern > > > > We have enabled traces and reproduced the issue as follows: > > > > 1.When GPFS daemon was in a respawn loop, we have started traces, all > > files from this period you can find in uploaded archive under > > *_1_nsd_threads_problem_* directory > > > > 2.We have manually stopped the ?respawn? loop on VLP1 by executing > > mmshutdown and start GPFS manually by mmstartup. All traces from this > > execution can be found in archive file under > *_2_mmshutdown_mmstartup > > _*directory > > > > All data related to this problem is uploaded to our ftp to file: > > > > ftp.ts.fujitsu.com/CS-Diagnose/IBM > > , (fe_cs_oem, 12Monkeys) > > item435_nsd_threads.tar.gz > > > > Could you please have a look at this problem? > > > > Best regards, > > > > Tomasz Wolski > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From secretary at gpfsug.org Tue Mar 20 08:48:19 2018 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Tue, 20 Mar 2018 08:48:19 +0000 Subject: [gpfsug-discuss] Upcoming meetings Message-ID: <785558aa15b26dbd44c9e22de3b13ef9@webmail.gpfsug.org> Dear members, There are a number of opportunities over the coming weeks for you to meet face to face with other group members and hear from Spectrum Scale experts. We'd love to see you at one of the events! If you plan to attend, please register: Spectrum Scale Usergroup, Singapore, March 26, https://www.eventbrite.com/e/spectrum-scale-user-group-singapore-march-2018-tickets-40429354287 [1] UK 2018 User Group Event, London, April 18 - April 19, https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList [2] IBM Technical University: Spectrum Scale Meet Up, London, May 14 Please email Par Hettinga par at nl.ibm.com USA 2018 Spectrum Scale User Group, Boston, May 16 - May 17, https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist [3] Thanks for your support, Claire -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://www.eventbrite.com/e/spectrum-scale-user-group-singapore-march-2018-tickets-40429354287 [2] https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-2018-registration-41489952565?aff=MailingList [3] https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist -------------- next part -------------- An HTML attachment was scrubbed... URL: From willi.engeli at id.ethz.ch Wed Mar 21 16:04:10 2018 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Wed, 21 Mar 2018 16:04:10 +0000 Subject: [gpfsug-discuss] CTDB RFE opened @ IBM Would like to ask for your votes Message-ID: Dear Collegues, [WE] I have missed the discussion on the CTDB upgradeability with interruption free methods. However, I hit this topic as well and some of our users where hit by the short interruption badly because of the kind of work they had running. This motivated me to open an Request for Enhancement for CTDB to support in a future release the interruption-less Upgrade. Here is the Link for the RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=117919 I hope this time it works at 1. Place...... Thanks in advance Willi -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From puthuppu at iu.edu Wed Mar 21 17:30:19 2018 From: puthuppu at iu.edu (Uthuppuru, Peter K) Date: Wed, 21 Mar 2018 17:30:19 +0000 Subject: [gpfsug-discuss] Hello Message-ID: <857be7f3815441c0a8e55816e61b6735@BL-CCI-D2S08.ads.iu.edu> Hello all, My name is Peter Uthuppuru and I work at Indiana University on the Research Storage team. I'm new to GPFS, HPC, etc. so I'm excited to learn more. Thanks, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5615 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Fri Mar 23 12:59:51 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 23 Mar 2018 12:59:51 +0000 Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for Speakers and Registration Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D@nuance.com> Reminder: The registration for the Spring meeting of the SSUG-USA is now open. This is a Free two-day and will include a large number of Spectrum Scale updates and breakout tracks. Please note that we have limited meeting space so please register early if you plan on attending. If you are interested in presenting, please contact me. We do have a few more slots for user presentations ? these do not need to be long. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Caron at sig.com Fri Mar 23 20:10:05 2018 From: Paul.Caron at sig.com (Caron, Paul) Date: Fri, 23 Mar 2018 20:10:05 +0000 Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <980f9a27290342e28197154b924d0abf@msx.bala.susq.com> Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" option for the file system is "cluster" * We have a pretty small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From G.Horton at bham.ac.uk Mon Mar 26 12:25:26 2018 From: G.Horton at bham.ac.uk (Gareth Horton) Date: Mon, 26 Mar 2018 11:25:26 +0000 Subject: [gpfsug-discuss] GPFS Encryption Message-ID: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Hi. All, I would be interested to hear if any members have experience implementing Encryption?, any gotchas, tips or any other information which may help with the preparation and implementation stages would be appreciated. I am currently reading through the documentation and reviewing the preparation steps, and with a scheduled maintenance window on the horizon it would be a good opportunity to carry out any preparatory steps requiring an outage. If there are any aspects of the configuration which in hindsight could have been done at the preparation stage this would be especially useful. Many Thanks Gareth ---------------------- On 24/03/2018, 12:00, "gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org" wrote: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Reminder - SSUG-US Spring meeting - Call for Speakers and Registration (Oesterlin, Robert) 2. Pool layoutMap option changes following GPFS upgrades (Caron, Paul) ---------------------------------------------------------------------- Message: 1 Date: Fri, 23 Mar 2018 12:59:51 +0000 From: "Oesterlin, Robert" To: gpfsug main discussion list Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for Speakers and Registration Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D at nuance.com> Content-Type: text/plain; charset="utf-8" Reminder: The registration for the Spring meeting of the SSUG-USA is now open. This is a Free two-day and will include a large number of Spectrum Scale updates and breakout tracks. Please note that we have limited meeting space so please register early if you plan on attending. If you are interested in presenting, please contact me. We do have a few more slots for user presentations ? these do not need to be long. You can register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489 DATE AND TIME Wed, May 16, 2018, 9:00 AM ? Thu, May 17, 2018, 5:00 PM EDT LOCATION IBM Cambridge Innovation Center One Rogers Street Cambridge, MA 02142-1203 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 23 Mar 2018 20:10:05 +0000 From: "Caron, Paul" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <980f9a27290342e28197154b924d0abf at msx.bala.susq.com> Content-Type: text/plain; charset="us-ascii" Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" option for the file system is "cluster" * We have a pretty small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 74, Issue 45 ********************************************** From chair at spectrumscale.org Mon Mar 26 12:52:26 2018 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Mon, 26 Mar 2018 12:52:26 +0100 Subject: [gpfsug-discuss] RFE Process ... Burning Issues Message-ID: <563267E8-EAE7-4C73-BA54-266DDE94AB02@spectrumscale.org> Hi All, We?ve been talking with product management about the RFE process and have agreed that we?ll try out a community-voting process. First up, we are piloting this idea, hopefully it will work out, but it may also need tweaks as we move forward. One of the things we?ve been asking for is for a better way for the Spectrum Scale user group community to vote on RFEs. Sure we get people posting to the list, but we?re looking at if we can make it a better/more formal process to support this. Talking with IBM, we also recognise that with a large number of RFEs, it can be difficult for them to track work tasks being completed, but with the community RFEs, there is a commitment to try and track them closely and report back on progress later in the year. To submit an RFE using this process, you must complete the form available at: https://ibm.box.com/v/EnhBlitz (Enhancement Blitz template v1.pptx) The form provides some guidance on a good and bad RFE. Sure a lot of us are techie/engineers, so please try to explain what problem you are solving rather than trying to provide a solution. (i.e. leave the technical implementation details to those with the source code). Each site is limited to 2 submissions and they will be looked over by the Spectrum Scale community leaders, we may ask people to merge requests, send back for more info etc, or there may be some that we know will just never be progressed for various reasons. At the April user group in the UK, we have an RFE (Burning issues) session planned. Submitters of the RFE will be expected to provide a 1-3 minute pitch for their RFE. We?ve placed the session at the end of the day (UK time) to try and ensure USA people can participate. Remote presentation of your RFE is fine and we plan to live-stream the session. Each person will have 3 votes to choose what they think are their highest priority requests. Again remote voting is perfectly fine but only 3 votes per person. The requests with the highest number of votes will then be given a higher chance of being implemented. There?s a possibility that some may even make the winter release cycle. Either way, we plan to track the ?chosen? RFEs more closely and provide an update at the November USA meeting (likely the SC18 one). The submission and voting process is also planned to be run again in time for the November meeting. Anyone wanting to submit an RFE for consideration should submit the form by email to rfe at spectrumscaleug.org *before* 13th April. We?ll be posting the submitted RFEs up at the box site as well, you are encouraged to visit the site regularly and check the submissions as you may want to contact the author of an RFE to provide more information/support the RFE. Anything received after this date will be held over to the November cycle. The earlier you submit, the better chance it has of being included (we plan to limit the number to be considered) and will give us time to review the RFE and come back for more information/clarification if needed. You must also be prepared to provide a 1-3 minute pitch for your RFE (in person or remote) for the UK user group meeting. You are welcome to submit any RFE you have already put into the RFE portal for this process to garner community votes for it. There is space on the form to provide the existing RFE number. If you have any comments on the process, you can also email them to rfe at spectrumscaleug.org as well. Thanks to Carl Zeite for supporting this plan? Get submitting! Simon (UK Group Chair) -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Mon Mar 26 13:14:35 2018 From: john.hearns at asml.com (John Hearns) Date: Mon, 26 Mar 2018 12:14:35 +0000 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> References: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Message-ID: Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Gareth Horton Sent: Monday, March 26, 2018 1:25 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GPFS Encryption Hi. All, I would be interested to hear if any members have experience implementing Encryption?, any gotchas, tips or any other information which may help with the preparation and implementation stages would be appreciated. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From S.J.Thompson at bham.ac.uk Mon Mar 26 13:46:47 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 26 Mar 2018 12:46:47 +0000 Subject: [gpfsug-discuss] GPFS Encryption Message-ID: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> John, I think we might need the decrypt key ... Simon ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. From jtucker at pixitmedia.com Mon Mar 26 13:48:56 2018 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 26 Mar 2018 13:48:56 +0100 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> References: <45854D94-EE43-4931-BF07-E1BD6773EAAE@bham.ac.uk> Message-ID: <3c977368-266f-1ec3-bafc-03adf872bb4d@pixitmedia.com> Try.... http://www.rot13.com/ On 26/03/18 13:46, Simon Thompson (IT Research Support) wrote: > John, > > I think we might need the decrypt key ... > > Simon > > ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: > > Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Mar 26 13:19:11 2018 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 26 Mar 2018 08:19:11 -0400 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> References: <7F99F86D-9818-45FC-92AD-971CF44B0462@bham.ac.uk> Message-ID: Hi Gareth: We have the spectrum archive product with encryption. It encrypts data on disk and tape...but not metadata. We originally had hoped to write small files with metadata...that does not happen with encryption. My guess is that the system pool(where metadata lives) cannot be encrypted. So you may pay a performance penalty for small files using encryption depending on what backends your data write policy. Eric On Mon, Mar 26, 2018 at 7:25 AM, Gareth Horton wrote: > Hi. All, > > I would be interested to hear if any members have experience implementing > Encryption?, any gotchas, tips or any other information which may help with > the preparation and implementation stages would be appreciated. > > I am currently reading through the documentation and reviewing the > preparation steps, and with a scheduled maintenance window on the horizon > it would be a good opportunity to carry out any preparatory steps requiring > an outage. > > If there are any aspects of the configuration which in hindsight could > have been done at the preparation stage this would be especially useful. > > Many Thanks > > Gareth > > ---------------------- > > On 24/03/2018, 12:00, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of gpfsug-discuss-request at spectrumscale.org" spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org> > wrote: > > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Reminder - SSUG-US Spring meeting - Call for Speakers and > Registration (Oesterlin, Robert) > 2. Pool layoutMap option changes following GPFS upgrades > (Caron, Paul) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 23 Mar 2018 12:59:51 +0000 > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Reminder - SSUG-US Spring meeting - Call for > Speakers and Registration > Message-ID: <28FF6725-9698-4F68-BC4E-BBCD6164BB3D at nuance.com> > Content-Type: text/plain; charset="utf-8" > > Reminder: The registration for the Spring meeting of the SSUG-USA is > now open. This is a Free two-day and will include a large number of > Spectrum Scale updates and breakout tracks. > > Please note that we have limited meeting space so please register > early if you plan on attending. If you are interested in presenting, please > contact me. We do have a few more slots for user presentations ? these do > not need to be long. > > You can register here: > > https://www.eventbrite.com/e/spectrum-scale-gpfs-user- > group-us-spring-2018-meeting-tickets-43662759489 > > DATE AND TIME > Wed, May 16, 2018, 9:00 AM ? > Thu, May 17, 2018, 5:00 PM EDT > > LOCATION > IBM Cambridge Innovation Center > One Rogers Street > Cambridge, MA 02142-1203 > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20180323/824dbcdc/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Fri, 23 Mar 2018 20:10:05 +0000 > From: "Caron, Paul" > To: "gpfsug-discuss at spectrumscale.org" > > Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS > upgrades > Message-ID: <980f9a27290342e28197154b924d0abf at msx.bala.susq.com> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > Has anyone run into a situation where the layoutMap option for a pool > changes from "scatter" to "cluster" following a GPFS software upgrade? We > recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to > 4.2.3.6. We noticed that the layoutMap option for two of our pools changed > following the upgrades. We didn't recreate the file system or any of the > pools. Further lab testing has revealed that the layoutMap option change > actually occurred during the first upgrade to 4.1.1.17, and it was simply > carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, > but they have told us that layoutMap option changes are impossible for > existing pools, and that a software upgrade couldn't do this. I sent the > results of my lab testing today, so I'm hoping to get a better response. > > We would rather not have to recreate all the pools, but it is starting > to look like that may be the only option to fix this. Also, it's unclear > if this could happen again during future upgrades. > > Here's some additional background. > > * The "-j" option for the file system is "cluster" > > * We have a pretty small cluster; just 13 nodes > > * When reproducing the problem, we noted that the layoutMap > option didn't change until the final node was upgraded > > * The layoutMap option changed before running the "mmchconfig > release=LATEST" and "mmchfs -V full" commands, so those don't seem to > be related to the problem > > Thanks, > > Paul C. > SIG > > > ________________________________ > > IMPORTANT: The information contained in this email and/or its > attachments is confidential. If you are not the intended recipient, please > notify the sender immediately by reply and immediately delete this message > and all its attachments. Any review, use, reproduction, disclosure or > dissemination of this message or any attachment by an unintended recipient > is strictly prohibited. Neither this message nor any attachment is intended > as or should be construed as an offer, solicitation or recommendation to > buy or sell any security or other financial instrument. Neither the sender, > his or her employer nor any of their respective affiliates makes any > warranties as to the completeness or accuracy of any of the information > contained herein or that this message or any of its attachments is free of > viruses. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20180323/181b0ac7/attachment-0001.html> > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 74, Issue 45 > ********************************************** > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Paul.Caron at sig.com Mon Mar 26 16:43:24 2018 From: Paul.Caron at sig.com (Caron, Paul) Date: Mon, 26 Mar 2018 15:43:24 +0000 Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Message-ID: <9b442159716e43f6a621c21f71067c0a@msx.bala.susq.com> By the way, the command to check the layoutMap option for your pools is "mmlspool all -L". Has anyone else noticed if this option changed during your GPFS software upgrades? Here's how our mmlspool output looked for our lab/test environment under GPFS Version 3.5.0-21: Pool: name = system poolID = 0 blockSize = 1024 KB usage = metadataOnly maxDiskSize = 16 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = writecache poolID = 65537 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.0 TB layoutMap = scatter allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = data poolID = 65538 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.2 TB layoutMap = scatter allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Here's the mmlspool output immediately after the upgrade to 4.1.1-17: Pool: name = system poolID = 0 blockSize = 1024 KB usage = metadataOnly maxDiskSize = 16 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = writecache poolID = 65537 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.0 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 Pool: name = data poolID = 65538 blockSize = 1024 KB usage = dataOnly maxDiskSize = 8.2 TB layoutMap = cluster allowWriteAffinity = no writeAffinityDepth = 0 blockGroupFactor = 1 We also determined the following: * The layoutMap option changes back to "scatter" if we revert back to 3.5.0.21. It only changes back after the last node is downgraded. * Restarting GPFS under 4.1.1-17 (via mmshutdown and mmstartup) has no effect on layoutMap in the lab (as expected). So, a simple restart doesn't fix the problem. Our production and lab deployments are using SLES 11, SP3 (3.0.101-0.47.71-default). Thanks, Paul C. SIG From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Caron, Paul Sent: Friday, March 23, 2018 4:10 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Pool layoutMap option changes following GPFS upgrades Hi, Has anyone run into a situation where the layoutMap option for a pool changes from "scatter" to "cluster" following a GPFS software upgrade? We recently upgraded a file system from 3.5.0.21, to 4.1.1.17, and finally to 4.2.3.6. We noticed that the layoutMap option for two of our pools changed following the upgrades. We didn't recreate the file system or any of the pools. Further lab testing has revealed that the layoutMap option change actually occurred during the first upgrade to 4.1.1.17, and it was simply carried forward to 4.2.3.6. We have a PMR open with IBM on this problem, but they have told us that layoutMap option changes are impossible for existing pools, and that a software upgrade couldn't do this. I sent the results of my lab testing today, so I'm hoping to get a better response. We would rather not have to recreate all the pools, but it is starting to look like that may be the only option to fix this. Also, it's unclear if this could happen again during future upgrades. Here's some additional background. * The "-j" file system is "cluster" * We have a pretty option for the small cluster; just 13 nodes * When reproducing the problem, we noted that the layoutMap option didn't change until the final node was upgraded * The layoutMap option changed before running the "mmchconfig release=LATEST" and "mmchfs -V full" commands, so those don't seem to be related to the problem Thanks, Paul C. SIG ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From JRLang at uwyo.edu Mon Mar 26 22:13:39 2018 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Mon, 26 Mar 2018 21:13:39 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Tue Mar 27 07:29:06 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Tue, 27 Mar 2018 06:29:06 +0000 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: <9a95b4b2c71748dfb4b39e23ffd4debf@SMXRF105.msg.hukrf.de> Hallo Jeff, you can check these with following cmd. mmfsadm dump nsdcksum Your in memory info is inconsistent with your descriptor structur on disk. The reason for this I had no idea. Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Jeffrey R. Lang Gesendet: Montag, 26. M?rz 2018 23:14 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Mar 27 07:44:29 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 27 Mar 2018 12:14:29 +0530 Subject: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive In-Reply-To: References: <0081EB235765E14395278B9AE1DF341846510A@MBX214.d.ethz.ch><4AD44D34-5275-4ADB-8CC7-8E80170DDA7F@brown.edu> Message-ID: This means that the stripe group descriptor on the disk dcs3800u31b_lun7 is corrupted. As we maintain copies of the stripe group descriptor on other disks as well we can copy the good descriptor from one of those disks to this one. Please open a PMR and work with IBM support to get this fixed. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Jeffrey R. Lang" To: gpfsug main discussion list Date: 03/27/2018 04:15 AM Subject: Re: [gpfsug-discuss] GPFS autoload - wait for IB portstobecomeactive Sent by: gpfsug-discuss-bounces at spectrumscale.org Can someone provide some clarification to this error message in my system logs: mmfs: [E] The on-disk StripeGroup descriptor of dcs3800u31b_lun7 sgId 0x0B00620A:9C84DF56 is not valid because of bad checksum: Mar 26 12:25:50 mmmnsd2 mmfs: 'mmfsadm writeDesc sg 0B00620A:9C84DF56 2 /var/mmfs/tmp/sg_gscratch_dcs3800u31b_lun7', where device is the device name of that NSD. I?ve been unable to find anything while googling that provides any details about the error. Anyone have any thoughts or commands? We are using GPFS 4.2.3.-6, under RedHat 6 and 7. The NSD nodes are all RHEL 6. Any help appreciated. Thanks Jeff_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=3u8q7zs1oLvf23bMVLe5YO_0SFSILFiL1d85LRDp9aQ&s=lf2ivnySwvhLDS-AnJSbm6cWcpO2R-vdHOll5TvkBDU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeep.patil at in.ibm.com Tue Mar 27 12:53:50 2018 From: sandeep.patil at in.ibm.com (Sandeep Ramesh) Date: Tue, 27 Mar 2018 17:23:50 +0530 Subject: [gpfsug-discuss] Latest Technical Blogs on Spectrum Scale In-Reply-To: References: Message-ID: Dear User Group Members, In continuation , here are list of development blogs in the this quarter (Q1 2018). As discussed in User Groups, passing it along: GDPR Compliance and Unstructured Data Storage https://developer.ibm.com/storage/2018/03/27/gdpr-compliance-unstructure-data-storage/ IBM Spectrum Scale for Linux on IBM Z ? Release 5.0 features and highlights https://developer.ibm.com/storage/2018/03/09/ibm-spectrum-scale-linux-ibm-z-release-5-0-features-highlights/ Management GUI enhancements in IBM Spectrum Scale release 5.0.0 https://developer.ibm.com/storage/2018/01/18/gui-enhancements-in-spectrum-scale-release-5-0-0/ IBM Spectrum Scale 5.0.0 ? What?s new in NFS? https://developer.ibm.com/storage/2018/01/18/ibm-spectrum-scale-5-0-0-whats-new-nfs/ Benefits and implementation of Spectrum Scale sudo wrappers https://developer.ibm.com/storage/2018/01/15/benefits-implementation-spectrum-scale-sudo-wrappers/ IBM Spectrum Scale: Big Data and Analytics Solution Brief https://developer.ibm.com/storage/2018/01/15/ibm-spectrum-scale-big-data-analytics-solution-brief/ Variant Sub-blocks in Spectrum Scale 5.0 https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ Compression support in Spectrum Scale 5.0.0 https://developer.ibm.com/storage/2018/01/11/compression-support-spectrum-scale-5-0-0/ IBM Spectrum Scale Versus Apache Hadoop HDFS https://developer.ibm.com/storage/2018/01/10/spectrumscale_vs_hdfs/ ESS Fault Tolerance https://developer.ibm.com/storage/2018/01/09/ess-fault-tolerance/ Genomic Workloads ? How To Get it Right From Infrastructure Point Of View. https://developer.ibm.com/storage/2018/01/06/genomic-workloads-get-right-infrastructure-point-view/ IBM Spectrum Scale On AWS Cloud : This video explains how to deploy IBM Spectrum Scale on AWS. This solution helps the users who require highly available access to a shared name space across multiple instances with good performance, without requiring an in-depth knowledge of IBM Spectrum Scale. Detailed Demo : https://www.youtube.com/watch?v=6j5Xj_d0bh4 Brief Demo : https://www.youtube.com/watch?v=-aMQKPW_RfY. For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media From: Sandeep Ramesh/India/IBM To: gpfsug-discuss at spectrumscale.org Cc: Doris Conti/Poughkeepsie/IBM at IBMUS Date: 01/10/2018 12:13 PM Subject: Re: Latest Technical Blogs on Spectrum Scale Dear User Group Members, Here are list of development blogs in the last quarter. Passing it to this email group as Doris had got a feedback in the UG meetings to notify the members with the latest updates periodically. Genomic Workloads ? How To Get it Right From Infrastructure Point Of View. https://developer.ibm.com/storage/2018/01/06/genomic-workloads-get-right-infrastructure-point-view/ IBM Spectrum Scale Versus Apache Hadoop HDFS https://developer.ibm.com/storage/2018/01/10/spectrumscale_vs_hdfs/ ESS Fault Tolerance https://developer.ibm.com/storage/2018/01/09/ess-fault-tolerance/ IBM Spectrum Scale MMFSCK ? Savvy Enhancements https://developer.ibm.com/storage/2018/01/05/ibm-spectrum-scale-mmfsck-savvy-enhancements/ ESS Disk Management https://developer.ibm.com/storage/2018/01/02/ess-disk-management/ IBM Spectrum Scale Object Protocol On Ubuntu https://developer.ibm.com/storage/2018/01/01/ibm-spectrum-scale-object-protocol-ubuntu/ IBM Spectrum Scale 5.0 ? Whats new in Unified File and Object https://developer.ibm.com/storage/2017/12/20/ibm-spectrum-scale-5-0-whats-new-object/ A Complete Guide to ? Protocol Problem Determination Guide for IBM Spectrum Scale? ? Part 1 https://developer.ibm.com/storage/2017/12/19/complete-guide-protocol-problem-determination-guide-ibm-spectrum-scale-1/ IBM Spectrum Scale installation toolkit ? enhancements over releases https://developer.ibm.com/storage/2017/12/15/ibm-spectrum-scale-installation-toolkit-enhancements-releases/ Network requirements in an Elastic Storage Server Setup https://developer.ibm.com/storage/2017/12/13/network-requirements-in-an-elastic-storage-server-setup/ Co-resident migration with Transparent cloud tierin https://developer.ibm.com/storage/2017/12/05/co-resident-migration-transparent-cloud-tierin/ IBM Spectrum Scale on Hortonworks HDP Hadoop clusters : A Complete Big Data Solution https://developer.ibm.com/storage/2017/12/05/ibm-spectrum-scale-hortonworks-hdp-hadoop-clusters-complete-big-data-solution/ Big data analytics with Spectrum Scale using remote cluster mount & multi-filesystem support https://developer.ibm.com/storage/2017/11/28/big-data-analytics-spectrum-scale-using-remote-cluster-mount-multi-filesystem-support/ IBM Spectrum Scale HDFS Transparency Short Circuit Write Support https://developer.ibm.com/storage/2017/11/28/ibm-spectrum-scale-hdfs-transparency-short-circuit-write-support/ IBM Spectrum Scale HDFS Transparency Federation Support https://developer.ibm.com/storage/2017/11/27/ibm-spectrum-scale-hdfs-transparency-federation-support/ How to configure and performance tuning different system workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-different-system-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning Spark workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-spark-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning database workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/27/configure-performance-tuning-database-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ How to configure and performance tuning Hadoop workloads on IBM Spectrum Scale Sharing Nothing Cluster https://developer.ibm.com/storage/2017/11/24/configure-performance-tuning-hadoop-workloads-ibm-spectrum-scale-sharing-nothing-cluster/ IBM Spectrum Scale Sharing Nothing Cluster Performance Tuning https://developer.ibm.com/storage/2017/11/24/ibm-spectrum-scale-sharing-nothing-cluster-performance-tuning/ How to Configure IBM Spectrum Scale? with NIS based Authentication. https://developer.ibm.com/storage/2017/11/21/configure-ibm-spectrum-scale-nis-based-authentication/ For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media From: Sandeep Ramesh/India/IBM To: gpfsug-discuss at spectrumscale.org Cc: Doris Conti/Poughkeepsie/IBM at IBMUS Date: 11/16/2017 08:15 PM Subject: Latest Technical Blogs on Spectrum Scale Dear User Group members, Here are the Development Blogs in last 3 months on Spectrum Scale Technical Topics. Spectrum Scale Monitoring ? Know More ? https://developer.ibm.com/storage/2017/11/16/spectrum-scale-monitoring-know/ IBM Spectrum Scale 5.0 Release ? What?s coming ! https://developer.ibm.com/storage/2017/11/14/ibm-spectrum-scale-5-0-release-whats-coming/ Four Essentials things to know for managing data ACLs on IBM Spectrum Scale? from Windows https://developer.ibm.com/storage/2017/11/13/four-essentials-things-know-managing-data-acls-ibm-spectrum-scale-windows/ GSSUTILS: A new way of running SSR, Deploying or Upgrading ESS Server https://developer.ibm.com/storage/2017/11/13/gssutils/ IBM Spectrum Scale Object Authentication https://developer.ibm.com/storage/2017/11/02/spectrum-scale-object-authentication/ Video Surveillance ? Choosing the right storage https://developer.ibm.com/storage/2017/11/02/video-surveillance-choosing-right-storage/ IBM Spectrum scale object deep dive training with problem determination https://www.slideshare.net/SmitaRaut/ibm-spectrum-scale-object-deep-dive-training Spectrum Scale as preferred software defined storage for Ubuntu OpenStack https://developer.ibm.com/storage/2017/09/29/spectrum-scale-preferred-software-defined-storage-ubuntu-openstack/ IBM Elastic Storage Server 2U24 Storage ? an All-Flash offering, a performance workhorse https://developer.ibm.com/storage/2017/10/06/ess-5-2-flash-storage/ A Complete Guide to Configure LDAP-based authentication with IBM Spectrum Scale? for File Access https://developer.ibm.com/storage/2017/09/21/complete-guide-configure-ldap-based-authentication-ibm-spectrum-scale-file-access/ Deploying IBM Spectrum Scale on AWS Quick Start https://developer.ibm.com/storage/2017/09/18/deploy-ibm-spectrum-scale-on-aws-quick-start/ Monitoring Spectrum Scale Object metrics https://developer.ibm.com/storage/2017/09/14/monitoring-spectrum-scale-object-metrics/ Tier your data with ease to Spectrum Scale Private Cloud(s) using Moonwalk Universal https://developer.ibm.com/storage/2017/09/14/tier-data-ease-spectrum-scale-private-clouds-using-moonwalk-universal/ Why do I see owner as ?Nobody? for my export mounted using NFSV4 Protocol on IBM Spectrum Scale?? https://developer.ibm.com/storage/2017/09/08/see-owner-nobody-export-mounted-using-nfsv4-protocol-ibm-spectrum-scale/ IBM Spectrum Scale? Authentication using Active Directory and LDAP https://developer.ibm.com/storage/2017/09/01/ibm-spectrum-scale-authentication-using-active-directory-ldap/ IBM Spectrum Scale? Authentication using Active Directory and RFC2307 https://developer.ibm.com/storage/2017/09/01/ibm-spectrum-scale-authentication-using-active-directory-rfc2307/ High Availability Implementation with IBM Spectrum Virtualize and IBM Spectrum Scale https://developer.ibm.com/storage/2017/08/30/high-availability-implementation-ibm-spectrum-virtualize-ibm-spectrum-scale/ 10 Frequently asked Questions on configuring Authentication using AD + AUTO ID mapping on IBM Spectrum Scale?. https://developer.ibm.com/storage/2017/08/04/10-frequently-asked-questions-configuring-authentication-using-ad-auto-id-mapping-ibm-spectrum-scale/ IBM Spectrum Scale? Authentication using Active Directory https://developer.ibm.com/storage/2017/07/30/ibm-spectrum-scale-auth-using-active-directory/ Five cool things that you didn?t know Transparent Cloud Tiering on Spectrum Scale can do https://developer.ibm.com/storage/2017/07/29/five-cool-things-didnt-know-transparent-cloud-tiering-spectrum-scale-can/ IBM Spectrum Scale GUI videos https://developer.ibm.com/storage/2017/07/25/ibm-spectrum-scale-gui-videos/ IBM Spectrum Scale? Authentication ? Planning for NFS Access https://developer.ibm.com/storage/2017/07/24/ibm-spectrum-scale-planning-nfs-access/ For more : Search /browse here: https://developer.ibm.com/storage/blog Consolidation list: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media -------------- next part -------------- An HTML attachment was scrubbed... URL: From bipcuds at gmail.com Tue Mar 27 23:26:16 2018 From: bipcuds at gmail.com (Keith Ball) Date: Tue, 27 Mar 2018 18:26:16 -0400 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Message-ID: Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Mar 28 00:44:33 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 27 Mar 2018 23:44:33 +0000 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? In-Reply-To: References: Message-ID: <7ae89940fa234b79b3538be339109cba@jumptrading.com> What version of GPFS are you running Keith? All nodes mounting the file system must briefly quiesce I/O operations during the snapshot create operations, hence the ?Quiescing all file system operations.? message in the output. So don?t really see a way to specify a specific set of nodes for these operations. They have made updates in newer releases of GPFS to combine operations (e.g. create and delete snapshots at the same time) which IBM says ?system performance is increased by batching operations and reducing overhead.?. Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can help them respond more quickly to quiesce I/O requests. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Tuesday, March 27, 2018 5:26 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dwayne.Hart at med.mun.ca Wed Mar 28 15:56:55 2018 From: Dwayne.Hart at med.mun.ca (Dwayne.Hart at med.mun.ca) Date: Wed, 28 Mar 2018 14:56:55 +0000 Subject: [gpfsug-discuss] Introduction to the "gpfsug-discuss" mailing list Message-ID: Hi, My name is Dwayne Hart. I currently work for the Center for Health Informatics & Analytics (CHIA), Faculty of Medicine at Memorial University of Newfoundland as a Systems/Network Security Administrator. In this role I am responsible for several HPC (Intel and Power) instances, OpenStack cloud environment and research data. We leverage IBM Spectrum Scale Storage as our primary storage solution. I have been working with GPFS since 2015. Best, Dwayne --- Systems Administrator Center for Health Informatics & Analytics (CHIA) Craig L. Dobbin Center for Genetics Room 4M409 300 Prince Philip Dr. St. John?s, NL Canada A1B 3V6 Tel: (709) 864-6631 E Mail: dwayne.hart at med.mun.ca -------------- next part -------------- An HTML attachment was scrubbed... URL: From ingo.altenburger at id.ethz.ch Thu Mar 29 13:20:45 2018 From: ingo.altenburger at id.ethz.ch (Altenburger Ingo (ID SD)) Date: Thu, 29 Mar 2018 12:20:45 +0000 Subject: [gpfsug-discuss] REST API function for 'mmsmb exportacl list' Message-ID: We were very hopeful to replace our storage provisioning automation based on cli commands with the new functions provided in REST API. Since it seems that almost all protocol related commands are already implemented with 5.0.0.1 REST interface, we have still not found an equivalent for mmsmb exportacl list to get the share permissions of a share. Does anybody know that this is already in but not yet documented or is it for sure still not under consideration? Thanks Ingo -------------- next part -------------- An HTML attachment was scrubbed... URL: From delmard at br.ibm.com Thu Mar 29 14:41:53 2018 From: delmard at br.ibm.com (Delmar Demarchi) Date: Thu, 29 Mar 2018 10:41:53 -0300 Subject: [gpfsug-discuss] AFM-DR Questions Message-ID: Hello experts. We have a Scale project with AFM-DR to be implemented and after read the KC documentation, we have some questions about. - Do you know any reason why we changed the Recovery point objective (RPO) snapshots by 15 to 720 minutes in the version 5.0.0 of IBM Spectrum Scale AFM-DR? - Can we use additional Independent Peer-snapshots to reduce the RPO interval (720 minutes) of IBM Spectrum Scale AFM-DR? - In addition to the above question, can we use these snapshots to update the new primary site after a failover occur for the most up to date snapshot? - According to the documentation, we are not able to replicate Dependent filesets, but if these dependents filesets are part of an existing Independent fileset. Do you see any issues/concerns with this? Thank you in advance. Delmar Demarchi .'. (delmard at br.ibm.com) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Thu Mar 29 17:00:57 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Thu, 29 Mar 2018 16:00:57 +0000 Subject: [gpfsug-discuss] GPFS Encryption In-Reply-To: <3c977368-266f-1ec3-bafc-03adf872bb4d@pixitmedia.com> Message-ID: I tried a dictionary attack, but ?nalguvta? was a typo. Should have been: ?Fbeel Tnergu. Pnaabg nqq nalguvat hfrshy urer? ? John: anythign (sic) to add? :-) Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com > On 26 Mar 2018, at 14:49, Jez Tucker wrote: > > Try.... http://www.rot13.com/ > >> On 26/03/18 13:46, Simon Thompson (IT Research Support) wrote: >> John, >> >> I think we might need the decrypt key ... >> >> Simon >> >> ?On 26/03/2018, 13:29, "gpfsug-discuss-bounces at spectrumscale.org on behalf of john.hearns at asml.com" wrote: >> >> Fbeel Tnergu. Pnaabg nqq nalguvta hfrshy urer. >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Jez Tucker > Head of Research and Development, Pixit Media > 07764193820 | jtucker at pixitmedia.com > www.pixitmedia.com | Tw:@pixitmedia.com > > > This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: From bipcuds at gmail.com Thu Mar 29 17:15:19 2018 From: bipcuds at gmail.com (Keith Ball) Date: Thu, 29 Mar 2018 12:15:19 -0400 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Message-ID: You're right, Brian, the key load will be on the filesystem manager in any case, and as you say, all nodes nodes must quiesce - it's not really an issue of where to run the command, like it would be for mmfsck, etc. GPFS version is 3.5.0.26. We'll investigate upgrade to later version that accommodates combined operations. I will also look into the cgroups approach - is this a documented thing, or just something that people have tinkered with/hand rolled? Thanks, Keith On Wed, Mar 28, 2018 at 7:00 AM, > > What version of GPFS are you running Keith? > > All nodes mounting the file system must briefly quiesce I/O operations > during the snapshot create operations, hence the ?Quiescing all file system > operations.? message in the output. So don?t really see a way to specify a > specific set of nodes for these operations. They have made updates in > newer releases of GPFS to combine operations (e.g. create and delete > snapshots at the same time) which IBM says ?system performance is increased > by batching operations and reducing overhead.?. > > Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU > and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can > help them respond more quickly to quiesce I/O requests. > > HTH, > -Bryan > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] On Behalf Of Keith Ball > Sent: Tuesday, March 27, 2018 5:26 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? > > Note: External Email > ________________________________ > Hi All, > Two questions on snapshots: > 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have > an "-N" option as "PIT" commands typically do. Is there any way to control > where threads for snapshot creation/deletion run? (I assume the filesystem > manager will always be involved regardless). > > 2.) When mmdelsnapshot hangs or times out, the error messages tend to > appear on client nodes, and not necessarily the node where mmdelsnapshot is > run from, not the FS manager. Besides telling all users "don't use any I/O" > when runnign these commands, are there ways that folks have found to avoid > hangs and timeouts of mmdelsnapshot? > FWIW our filesystem manager is probably overextended (replication factor 2 > on data+MD, 30 daily snapshots kept, a number of client clusters served, > plus the FS manager is also an NSD server). > > Many Thanks, > Keith > RedLine Performance Solutions LLC > > ________________________________ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Mar 29 18:33:30 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 29 Mar 2018 17:33:30 +0000 Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? In-Reply-To: References: Message-ID: The cgroups are something we moved onto, which has helped a lot with GPFS Clients responding to necessary GPFS commands demanding a low latency response (e.g. mmcrsnapshots, byte range locks, quota reporting, etc). Good luck! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Thursday, March 29, 2018 11:15 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ You're right, Brian, the key load will be on the filesystem manager in any case, and as you say, all nodes nodes must quiesce - it's not really an issue of where to run the command, like it would be for mmfsck, etc. GPFS version is 3.5.0.26. We'll investigate upgrade to later version that accommodates combined operations. I will also look into the cgroups approach - is this a documented thing, or just something that people have tinkered with/hand rolled? Thanks, Keith On Wed, Mar 28, 2018 at 7:00 AM, What version of GPFS are you running Keith? All nodes mounting the file system must briefly quiesce I/O operations during the snapshot create operations, hence the ?Quiescing all file system operations.? message in the output. So don?t really see a way to specify a specific set of nodes for these operations. They have made updates in newer releases of GPFS to combine operations (e.g. create and delete snapshots at the same time) which IBM says ?system performance is increased by batching operations and reducing overhead.?. Trying to isolate GPFS resources (e.g. cgroups) on the clients (e.g. CPU and memory resources dedicated to GPFS/SSH/kernel/networking/etc) can help them respond more quickly to quiesce I/O requests. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Keith Ball Sent: Tuesday, March 27, 2018 5:26 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Control of where mmcrsnapshot runs? Note: External Email ________________________________ Hi All, Two questions on snapshots: 1.) I note that neither "mmcrsnapshot" nor "mmdelsnapshot" appear to have an "-N" option as "PIT" commands typically do. Is there any way to control where threads for snapshot creation/deletion run? (I assume the filesystem manager will always be involved regardless). 2.) When mmdelsnapshot hangs or times out, the error messages tend to appear on client nodes, and not necessarily the node where mmdelsnapshot is run from, not the FS manager. Besides telling all users "don't use any I/O" when runnign these commands, are there ways that folks have found to avoid hangs and timeouts of mmdelsnapshot? FWIW our filesystem manager is probably overextended (replication factor 2 on data+MD, 30 daily snapshots kept, a number of client clusters served, plus the FS manager is also an NSD server). Many Thanks, Keith RedLine Performance Solutions LLC ________________________________ ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Mar 30 08:35:33 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 30 Mar 2018 13:05:33 +0530 Subject: [gpfsug-discuss] AFM-DR Questions In-Reply-To: References: Message-ID: + Venkat to provide answers on AFM queries Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Delmar Demarchi" To: gpfsug-discuss at spectrumscale.org Date: 03/29/2018 07:12 PM Subject: [gpfsug-discuss] AFM-DR Questions Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello experts. We have a Scale project with AFM-DR to be implemented and after read the KC documentation, we have some questions about. - Do you know any reason why we changed the Recovery point objective (RPO) snapshots by 15 to 720 minutes in the version 5.0.0 of IBM Spectrum Scale AFM-DR? - Can we use additional Independent Peer-snapshots to reduce the RPO interval (720 minutes) of IBM Spectrum Scale AFM-DR? - In addition to the above question, can we use these snapshots to update the new primary site after a failover occur for the most up to date snapshot? - According to the documentation, we are not able to replicate Dependent filesets, but if these dependents filesets are part of an existing Independent fileset. Do you see any issues/concerns with this? Thank you in advance. Delmar Demarchi .'. (delmard at br.ibm.com)_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=nBTENLroUhlIPgOEVV1rqTmcYxRh7ErhZ7jLWdpprlY&s=V0Xb_-yxttxff7X31CfkaegWKSGc-1ehsXrDpdO5dTI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Mar 30 14:54:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 30 Mar 2018 13:54:01 +0000 Subject: [gpfsug-discuss] Tentative Agenda - SSUG-US Spring Meeting - May 16/17, Cambridge MA Message-ID: Here is the Tentative Agenda for the upcoming SSUG-US meeting. It?s close to final. I do have one (possibly two) spots for customer talks still open. This is a fantastic agenda, and a big thanks to Ulf Troppens at IBM for pulling together all the IBM speakers. Register here: https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489?aff=mailinglist Wednesday, May 16th 8:30 9:00 Registration and Networking 9:00 9:20 Welcome 9:20 9:45 Keynote: Cognitive Computing and Spectrum Scale 9:45 10:10 Spectrum Scale Big Data & Analytics Initiative 10:10 10:30 Customer Talk 10:30 10:45 Break 10:45 11:10 Spectrum Scale Cloud Initiative 11:10 11:35 Composable Infrastructure for Technical Computing 11:35 11:55 Customer Talk 11:55 12:00 Agenda 12:00 13:00 Lunch and Networking 13:00 13:30 What is new in Spectrum Scale 13:30 13:45 What is new in ESS? 13:45 14:15 File System Audit Log 14:15 14:45 Coffee and Networking 14:45 15:15 Lifting the 32 subblock limit 15:15 15:35 Customer Talk 15:35 16:05 Spectrum Scale CCR Internals 16:05 16:20 Break 16:20 16:40 Customer Talk 16:40 17:25 Field Update 17:25 18:15 Meet the Devs - Ask us Anything Evening Networking Event - TBD Thursday, May 17th 8:30 9:00 Kaffee und Networking 9:00 10:00 1) Life Science Track 2) System Health, Performance Monitoring & Call Home 3) Policy Engine Best Practices 10:00 11:00 1) Life Science Track 2) Big Data & Analytics 3) Multi-cloud with Transparent Cloud Tiering 11:00 12:00 1) Life Science Track 2) Cloud Deployments 3) Installation Best Practices 12:00 13:00 Lunch and Networking 13:00 13:20 Customer Talk 13:20 14:10 Network Best Practices 14:10 14:30 Customer Talk 14:30 15:00 Kaffee und Networking 15:00 16:00 Enhancements for CORAL Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Mar 30 17:15:13 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 30 Mar 2018 12:15:13 -0400 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 Message-ID: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem ?without? spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag ? ? ? ? ? ? ? ?value ? ? ? ? ? ? ? ? ? ?description ------------------- ------------------------ ----------------------------------- ?--subblocks-per-full-block 1024 ? ? ? ? ? ? Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Mar 30 17:45:41 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 30 Mar 2018 11:45:41 -0500 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Message-ID: Apparently, a small mistake in that developer works post. I always advise testing of new features on a scratchable system... Here's what I see on my test system: #mmcrfs mak -F /vds/mn.nsd -A no -T /mak -B 16M -f 1K -i 1K Value '1024' for option '-f' is out of range. Valid values are 4096 through 524288. # mmcrfs mak -F /vds/mn.nsd -A no -T /mak -B 16M -f 4K -i 1K (runs okay) # mmlsfs mak flag value description ------------------- ------------------------ ----------------------------------- -f 4096 Minimum fragment (subblock) size in bytes -i 1024 Inode size in bytes -I 32768 Indirect block size in bytes ... -B 16777216 Block size ... -V 18.00 (5.0.0.0) File system version ... --subblocks-per-full-block 4096 Number of subblocks per full block ... From: valleru at cbio.mskcc.org To: gpfsug main discussion list Date: 03/30/2018 12:21 PM Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem ?without? spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag value description ------------------- ------------------------ ----------------------------------- --subblocks-per-full-block 1024 Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8&m=HNrrMTazEN37eiIyxj9LWFMt2v1vCWeYuAGeHXXgIN8&s=Q6RUpDte4cePcCa_VU9ClyOvHMwhOWg8H1sRVLv9ocU&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Mar 30 18:47:27 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 30 Mar 2018 13:47:27 -0400 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> Message-ID: <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> Thanks Mark, I did not know, we could explicitly mention sub-block size when creating File system. It is no-where mentioned in the ?man mmcrfs?. Is this a new GPFS 5.0 feature? Also, i see from the ?man mmcrfs? that the default sub-block size for 8M and 16M is 16K. +???????????????????????????????+???????????????????????????????+ | Block size ? ? ? ? ? ? ? ? ? ?| Subblock size ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 64 KiB ? ? ? ? ? ? ? ? ? ? ? ?| 2 KiB ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 128 KiB ? ? ? ? ? ? ? ? ? ? ? | 4 KiB ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 256 KiB, 512 KiB, 1 MiB, 2 ? ?| 8 KiB ? ? ? ? ? ? ? ? ? ? ? ? | | MiB, 4 MiB ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | +???????????????????????????????+???????????????????????????????+ | 8 MiB, 16 MiB ? ? ? ? ? ? ? ? | 16 KiB ? ? ? ? ? ? ? ? ? ? ? ?| +???????????????????????????????+???????????????????????????????+ And you could create more than 1024 sub-blocks per block? and 4k is size of sub-block for 16M? That is great, since 4K files will go into data pool, and anything less than 4K will go to system (metadata) pool? Do you think - there would be any performance degradation for reducing the sub-blocks to 4K - 8K, from the default 16K for 16M filesystem? If we are not loosing any blocks by choosing a bigger block-size (16M) for filesystem, why would we want to choose a smaller block-size for filesystem (4M)? What advantage would smaller block-size (4M) give, compared to 16M with performance since 16M filesystem could store small files and read small files too at the respective sizes? And Near Line Rotating disks would be happy with bigger block-size than smaller block-size i guess? Regards, Lohit On Mar 30, 2018, 12:45 PM -0400, Marc A Kaplan , wrote: > > subblock -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Fri Mar 30 19:47:47 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 30 Mar 2018 13:47:47 -0500 Subject: [gpfsug-discuss] sublocks per block in GPFS 5.0 In-Reply-To: <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> References: <68905b2c-8b1a-4a3d-8ded-c5aa56b765aa@Spark> <18518530-0d1f-4937-b2ec-9c16c6c80995@Spark> Message-ID: Look at my example, again, closely. I chose the blocksize as 16M and subblock size as 4K and the inodesize as 1K.... Developer works is a good resource, but articles you read there may be incomplete or contain mistakes. The official IBM Spectrum Scale cmd and admin guide documents, are "trustworthy" but may not be perfect in all respects. "Trust but Verify" and YMMV. ;-) As for why/how to choose "good sizes", that depends what objectives you want to achieve, and "optimal" may depend on what hardware you are running. Run your own trials and/or ask performance experts. There are usually "tradeoffs" and OTOH when you get down to it, some choices may not be all-that-important in actual deployment and usage. That's why we have defaults values - try those first and leave the details and tweaking aside until you have good reason ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: