From zander at ebi.ac.uk Fri Aug 1 14:44:49 2014 From: zander at ebi.ac.uk (Zander Mears) Date: Fri, 01 Aug 2014 14:44:49 +0100 Subject: [gpfsug-discuss] Hello! In-Reply-To: <53D981EF.3020000@gpfsug.org> References: <53D8C897.9000902@ebi.ac.uk> <53D981EF.3020000@gpfsug.org> Message-ID: <53DB99D1.8050304@ebi.ac.uk> Hi Jez We're just monitoring the standard OS stuff, some interface errors, throughput, number of network and gpfs connections due to previous issues. We don't really know as yet what is good to monitor GPFS wise. cheers Zander On 31/07/2014 00:38, Jez Tucker (Chair) wrote: > Hi Zander, > > We have a git repository. Would you be interested in adding any > Zabbix custom metrics gathering to GPFS to it? > > https://github.com/gpfsug/gpfsug-tools > > Best, > > Jez From sfadden at us.ibm.com Tue Aug 5 18:55:20 2014 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 5 Aug 2014 10:55:20 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are doing it, I am looking for some confirmation. Thanks Scott Fadden GPFS Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/gpfs -------------- next part -------------- An HTML attachment was scrubbed... URL: From u.sibiller at science-computing.de Wed Aug 6 08:46:31 2014 From: u.sibiller at science-computing.de (Ulrich Sibiller) Date: Wed, 06 Aug 2014 09:46:31 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1DD57.90103@science-computing.de> Am 05.08.2014 19:55, schrieb Scott Fadden: > Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are > doing it, I am looking for some confirmation. I have some nodes running lustre 2.1.6 or 2.5.58 and gpfs 3.5.0.17 on RHEL5.8 and RHEL6.5. None of them are servers. Kind regards, Ulrich Sibiller -- ______________________________________creating IT solutions Dipl.-Inf. Ulrich Sibiller science + computing ag System Administration Hagellocher Weg 73 mail nfz at science-computing.de 72070 Tuebingen, Germany hotline +49 7071 9457 674 http://www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 From frederik.ferner at diamond.ac.uk Wed Aug 6 10:19:35 2014 From: frederik.ferner at diamond.ac.uk (Frederik Ferner) Date: Wed, 6 Aug 2014 10:19:35 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1F327.1000605@diamond.ac.uk> On 05/08/14 18:55, Scott Fadden wrote: > Is anyone running GPFS and Lustre on the same nodes. I have seen it > work, I have heard people are doing it, I am looking for some confirmation. Most of our compute cluster nodes are clients for Lustre and GPFS at the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on servers (GPFS NSD server or Lustre OSS/MDS servers). HTH, Frederik -- Frederik Ferner Senior Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From sdinardo at ebi.ac.uk Wed Aug 6 10:57:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 06 Aug 2014 10:57:44 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1F327.1000605@diamond.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> Message-ID: <53E1FC18.6080707@ebi.ac.uk> Sorry for this little ot, but recetly i'm looking to Lustre to understand how it is comparable to GPFS in terms of performance, reliability and easy to use. Could anyone share their experience ? My company just recently got a first GPFS system , based on IBM GSS, but while its good performance wise, there are few unresolved problems and the IBM support is almost unexistent, so I'm starting to wonder if its work to look somewhere else eventual future purchases. Salvatore On 06/08/14 10:19, Frederik Ferner wrote: > On 05/08/14 18:55, Scott Fadden wrote: >> Is anyone running GPFS and Lustre on the same nodes. I have seen it >> work, I have heard people are doing it, I am looking for some >> confirmation. > > Most of our compute cluster nodes are clients for Lustre and GPFS at > the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on > servers (GPFS NSD server or Lustre OSS/MDS servers). > > HTH, > Frederik > From chair at gpfsug.org Wed Aug 6 11:19:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 06 Aug 2014 11:19:24 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1FC18.6080707@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> Message-ID: <53E2012C.9040402@gpfsug.org> "IBM support is almost unexistent" I don't find that at all. Do you log directly via ESC or via your OEM/integrator or are you only referring to GSS support rather than pure GPFS? If you are having response issues, your IBM rep (or a few folks on here) can accelerate issues for you. Jez On 06/08/14 10:57, Salvatore Di Nardo wrote: > Sorry for this little ot, but recetly i'm looking to Lustre to > understand how it is comparable to GPFS in terms of performance, > reliability and easy to use. > Could anyone share their experience ? > > My company just recently got a first GPFS system , based on IBM GSS, > but while its good performance wise, there are few unresolved problems > and the IBM support is almost unexistent, so I'm starting to wonder if > its work to look somewhere else eventual future purchases. > > > Salvatore > > On 06/08/14 10:19, Frederik Ferner wrote: >> On 05/08/14 18:55, Scott Fadden wrote: >>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>> work, I have heard people are doing it, I am looking for some >>> confirmation. >> >> Most of our compute cluster nodes are clients for Lustre and GPFS at >> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >> servers (GPFS NSD server or Lustre OSS/MDS servers). >> >> HTH, >> Frederik >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Wed Aug 6 14:26:47 2014 From: service at metamodul.com (service at metamodul.com) Date: Wed, 6 Aug 2014 15:26:47 +0200 (CEST) Subject: [gpfsug-discuss] Hi , i am new to this list Message-ID: <1366482624.222989.1407331607965.open-xchange@oxbaltgw55.schlund.de> Hi @ALL i am Hajo Ehlers , an AIX and GPFS specialist ( Unix System Engineer ). You find me at the IBM GPFS Forum and sometimes at news:c.u.a and I am addicted to cluster filesystems My latest idee is an SAP-HANA light system ( DBMS on an in-memory cluster posix FS ) which could be extended to a "reinvented" Cluster based AS/400 ^_^ I wrote also a small script to do a sequential backup of GPFS filesystems since i got never used to mmbackup - i named it "pdsmc" for parallel dsmc". Cheers Hajo BTW: Please let me know - service (at) metamodul (dot) com - In case somebody is looking for a GPFS specialist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 8 10:53:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 08 Aug 2014 10:53:36 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E2012C.9040402@gpfsug.org> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> Message-ID: <53E49E20.1090905@ebi.ac.uk> Well, i didn't wanted to start a rant against IBM, and I'm referring specifically to GSS. Since GSS its an appliance, we have to refer to GSS support for both hardware and software issues. Hardware support in total crap. It took 1 mounth of chasing and shouting to get a drawer replacement that was causing some issues. Meanwhile 10 disks in that drawer got faulty. Finally we got the drawer replace but the disks are still faulty. Now its 3 days i'm triing to get them fixed or replaced ( its not clear if they disks are broken of they was just marked to be replaced because of the drawer). Right now i dont have any answer about how to put them online ( mmchcarrier don't work because it recognize that the disk where not replaced) There are also few other cases ( gpfs related) open that are still not answered. I have no experience with direct GPFS support, but if i open a case to GSS for a GPFS problem, the cases seems never get an answer. The only reason that GSS is working its because _*I*_**installed it spending few months studying gpfs. So now I'm wondering if its worth at all rely in future on the whole appliance concept. I'm wondering if in future its better just purchase the hardware and install GPFS by our own, or in alternatively even try Lustre. Now, skipping all this GSS rant, which have nothing to do with the file system anyway and going back to my question: Could someone point the main differences between GPFS and Lustre? I found some documentation about Lustre and i'm going to have a look, but oddly enough have not found any practical comparison between them. On 06/08/14 11:19, Jez Tucker (Chair) wrote: > "IBM support is almost unexistent" > > I don't find that at all. > Do you log directly via ESC or via your OEM/integrator or are you only > referring to GSS support rather than pure GPFS? > > If you are having response issues, your IBM rep (or a few folks on > here) can accelerate issues for you. > > Jez > > > On 06/08/14 10:57, Salvatore Di Nardo wrote: >> Sorry for this little ot, but recetly i'm looking to Lustre to >> understand how it is comparable to GPFS in terms of performance, >> reliability and easy to use. >> Could anyone share their experience ? >> >> My company just recently got a first GPFS system , based on IBM GSS, >> but while its good performance wise, there are few unresolved >> problems and the IBM support is almost unexistent, so I'm starting to >> wonder if its work to look somewhere else eventual future purchases. >> >> >> Salvatore >> >> On 06/08/14 10:19, Frederik Ferner wrote: >>> On 05/08/14 18:55, Scott Fadden wrote: >>>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>>> work, I have heard people are doing it, I am looking for some >>>> confirmation. >>> >>> Most of our compute cluster nodes are clients for Lustre and GPFS at >>> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >>> servers (GPFS NSD server or Lustre OSS/MDS servers). >>> >>> HTH, >>> Frederik >>> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpro at bas.ac.uk Fri Aug 8 12:40:00 2014 From: jpro at bas.ac.uk (Jeremy Robst) Date: Fri, 8 Aug 2014 12:40:00 +0100 (BST) Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E49E20.1090905@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > Now, skipping all this GSS rant, which have nothing to do with the file > system anyway? and? going back to my question: > > Could someone point the main differences between GPFS and Lustre? I'm looking at making the same decision here - to buy GPFS or to roll our own Lustre configuration. I'm in the process of setting up test systems, and so far the main difference seems to be in the that in GPFS each server sees the full filesystem, and so you can run other applications (e.g backup) on a GPFS server whereas the Luste OSS (object storage servers) see only a portion of the storage (the filesystem is striped across the OSSes), so you need a Lustre client to mount the full filesystem for things like backup. However I have very little practical experience of either and would also be interested in any comments. Thanks Jeremy -- jpro at bas.ac.uk | (work) 01223 221402 (fax) 01223 362616 Unix System Administrator - British Antarctic Survey #include From keith at ocf.co.uk Fri Aug 8 14:12:39 2014 From: keith at ocf.co.uk (Keith Vickers) Date: Fri, 8 Aug 2014 14:12:39 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: http://www.pdsw.org/pdsw10/resources/posters/parallelNASFSs.pdf Has a good direct apples to apples comparison between Lustre and GPFS. It's pretty much abstractable from the hardware used. Keith Vickers Business Development Manager OCF plc Mobile: 07974 397863 From sergi.more at bsc.es Fri Aug 8 14:14:33 2014 From: sergi.more at bsc.es (=?ISO-8859-1?Q?Sergi_Mor=E9_Codina?=) Date: Fri, 08 Aug 2014 15:14:33 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: <53E4CD39.7080808@bsc.es> Hi all, About main differences between GPFS and Lustre, here you have some bits from our experience: -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. Best Regards, Sergi. On 08/08/2014 01:40 PM, Jeremy Robst wrote: > On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > >> Now, skipping all this GSS rant, which have nothing to do with the file >> system anyway and going back to my question: >> >> Could someone point the main differences between GPFS and Lustre? > > I'm looking at making the same decision here - to buy GPFS or to roll > our own Lustre configuration. I'm in the process of setting up test > systems, and so far the main difference seems to be in the that in GPFS > each server sees the full filesystem, and so you can run other > applications (e.g backup) on a GPFS server whereas the Luste OSS (object > storage servers) see only a portion of the storage (the filesystem is > striped across the OSSes), so you need a Lustre client to mount the full > filesystem for things like backup. > > However I have very little practical experience of either and would also > be interested in any comments. > > Thanks > > Jeremy > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- ------------------------------------------------------------------------ Sergi More Codina Barcelona Supercomputing Center Centro Nacional de Supercomputacion WWW: http://www.bsc.es Tel: +34-93-405 42 27 e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 ------------------------------------------------------------------------ WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3242 bytes Desc: S/MIME Cryptographic Signature URL: From viccornell at gmail.com Fri Aug 8 18:15:30 2014 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 8 Aug 2014 18:15:30 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E4CD39.7080808@bsc.es> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> Message-ID: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at us.ibm.com Fri Aug 8 20:09:44 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 8 Aug 2014 12:09:44 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Message-ID: Vic, Sergi, you can not compare Lustre and GPFS without providing a clear usecase as otherwise you compare apple with oranges. the reason for this is quite simple, Lustre plays well in pretty much one usecase - HPC, GPFS on the other hand is used in many forms of deployments from Storage for Virtual Machines, HPC, Scale-Out NAS, Solutions in digital media, to hosting some of the biggest, most business critical Transactional database installations in the world. you look at 2 products with completely different usability spectrum, functions and features unless as said above you narrow it down to a very specific usecase with a lot of details. even just HPC has a very large spectrum and not everybody is working in a single directory, which is the main scale point for Lustre compared to GPFS and the reason is obvious, if you have only 1 active metadata server (which is what 99% of all lustre systems run) some operations like single directory contention is simpler to make fast, but only up to the limit of your one node, but what happens when you need to go beyond that and only a real distributed architecture can support your workload ? for example look at most chip design workloads, which is a form of HPC, it is something thats extremely metadata and small file dominated, you talk about 100's of millions (in some cases even billions) of files, majority of them <4k, the rest larger files , majority of it with random access patterns that benefit from massive client side caching and distributed data coherency models supported by GPFS token manager infrastructure across 10's or 100's of metadata server and 1000's of compute nodes. you also need to look at the rich feature set GPFS provides, which not all may be important for some environments but are for others like Snapshot, Clones, Hierarchical Storage Management (ILM) , Local Cache acceleration (LROC), Global Namespace Wan Integration (AFM), Encryption, etc just to name a few. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Vic Cornell To: gpfsug main discussion list Date: 08/08/2014 10:16 AM Subject: Re: [gpfsug-discuss] GPFS and Lustre on same node Sent by: gpfsug-discuss-bounces at gpfsug.org Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kraemerf at de.ibm.com Sat Aug 9 15:03:02 2014 From: kraemerf at de.ibm.com (Frank Kraemer) Date: Sat, 9 Aug 2014 16:03:02 +0200 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: Message-ID: Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany From ewahl at osc.edu Mon Aug 11 14:55:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 11 Aug 2014 13:55:48 +0000 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: , Message-ID: In a similar vein, IBM has an application transparent "File Cache Library" as well. I believe it IS licensed and the only requirement is that it is for use on IBM hardware only. Saw some presentations that mention it in some BioSci talks @SC13 and the numbers for a couple of selected small read applications were awesome. I probably have the contact info for it around here somewhere. In addition to the pdf/user manual. Ed Wahl Ohio Supercomputer Center ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Frank Kraemer [kraemerf at de.ibm.com] Sent: Saturday, August 09, 2014 10:03 AM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] GPFS and Lustre Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From sabujp at gmail.com Tue Aug 12 23:16:22 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 12 Aug 2014 17:16:22 -0500 Subject: [gpfsug-discuss] reduce cnfs failover time to a few seconds Message-ID: Hi all, Is there anyway to reduce CNFS failover time to just a few seconds? Currently it seems like it's taking 5 - 10 minutes. We're using virtual ip's, i.e. interface bond1.1550:0 has one of the cnfs vips, so it should be fast, but it takes a long time and sometimes causes processes to crash due to NFS timeouts (some have 600 second soft mount timeouts). We've also noticed that it sometimes takes even longer unless the cnfs system on which we're calling mmshutdown is completely shutdown and isn't returning pings. Even 1 min seems too long. For comparison, I'm running ctdb + samba on the other NSDs and it's able to failover in a few seconds after mmshutdown completes. Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 15 14:31:29 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 15 Aug 2014 14:31:29 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters Message-ID: <53EE0BB1.8000005@ebi.ac.uk> Hello people, Its quite a bit of time that i'm triing to solve a problem to our GPFS system, without much luck so i think its time to ask some help. *First of a bit of introduction:** * Our GPFS system is made by 3xgss-26, In other words its made with 6x servers ( 4x10g links each) and several disk enclosures SAS attacked. The todal amount of spare its roughly 2PB, and the disks are SATA ( except few SSD dedicated to logtip ). My metadata and on dedicated vdisks, but both data and metadata vdiosks are in the same declustered arrays and recovery groups, so in the end they share the same spindles. The clients its a LSF farm configured as another cluster ( standard multiclustering configuration) of roughly 600 nodes . *The issue:** * Recently we became aware that when some massive io request has been done we experience a lot of client expells. Heres an example of our logs: Fri Aug 15 12:40:24.680 2014: Expel 10.7.28.34 (gss03a) request from 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:41.652 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:45.754 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:52.305 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:17.069 2014: Expel 10.7.28.35 (gss03b) request from 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:23.555 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.258 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.540 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:42:57.288 2014: Expel 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) request from 10.7.28.34 (gss03a). Expelling: 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:43:24.327 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:44:54.202 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:15:54.450 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:20:16.524 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:26:54.177 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:27:53.900 2014: Expel 10.7.28.3 (gss01b) request from 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:28:24.297 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:29:23.913 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk) at the same time we experience also long waiters queue (1000+ lines). An example in case of massive writes ( dd ) : 0x7F522E1EEF90 waiting 1.861233182 seconds, NSDThread: on ThCond 0x7F5158019B08 (0x7F5158019B08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.101 0x7F522E1EC9B0 waiting 1.490567470 seconds, NSDThread: on ThCond 0x7F50F4038BA8 (0x7F50F4038BA8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1EB6C0 waiting 1.077098046 seconds, NSDThread: on ThCond 0x7F50B40011F8 (0x7F50B40011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.156 0x7F522E1EA3D0 waiting 7.714968554 seconds, NSDThread: on ThCond 0x7F50BC0078B8 (0x7F50BC0078B8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.107 0x7F522E1E90E0 waiting 4.774379417 seconds, NSDThread: on ThCond 0x7F506801B1F8 (0x7F506801B1F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.23 0x7F522E1E7DF0 waiting 0.746172444 seconds, NSDThread: on ThCond 0x7F5094007D78 (0x7F5094007D78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.84 0x7F522E1E6B00 waiting 1.553030487 seconds, NSDThread: on ThCond 0x7F51C0004C78 (0x7F51C0004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.63 0x7F522E1E5810 waiting 2.165307633 seconds, NSDThread: on ThCond 0x7F5178016A08 (0x7F5178016A08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.29 0x7F522E1E4520 waiting 1.128089273 seconds, NSDThread: on ThCond 0x7F5074004D98 (0x7F5074004D98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1E3230 waiting 2.515214328 seconds, NSDThread: on ThCond 0x7F51F400EF08 (0x7F51F400EF08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.90 0x7F522E1E1F40 waiting*162.966840834* seconds, NSDThread: on ThCond 0x7F51840207A8 (0x7F51840207A8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.97 0x7F522E1E0C50 waiting 1.140787288 seconds, NSDThread: on ThCond 0x7F51AC005C08 (0x7F51AC005C08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.94 0x7F522E1DF960 waiting 41.907415248 seconds, NSDThread: on ThCond 0x7F5160019038 (0x7F5160019038) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1DE670 waiting 0.466560418 seconds, NSDThread: on ThCond 0x7F513802B258 (0x7F513802B258) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.168 0x7F522E1DD380 waiting 3.102803621 seconds, NSDThread: on ThCond 0x7F516C0106C8 (0x7F516C0106C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.91 0x7F522E1DC090 waiting 2.751614295 seconds, NSDThread: on ThCond 0x7F504C0011F8 (0x7F504C0011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.25 0x7F522E1DADA0 waiting 5.083691891 seconds, NSDThread: on ThCond 0x7F507401BE88 (0x7F507401BE88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1D9AB0 waiting 2.263374184 seconds, NSDThread: on ThCond 0x7F5080003B98 (0x7F5080003B98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.36 0x7F522E1D87C0 waiting 0.206989639 seconds, NSDThread: on ThCond 0x7F505801F0D8 (0x7F505801F0D8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.55 0x7F522E1D74D0 waiting *41.841279897* seconds, NSDThread: on ThCond 0x7F5194008B88 (0x7F5194008B88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1D61E0 waiting 5.618652361 seconds, NSDThread: on ThCond 0x1BAB868 (0x1BAB868) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.59 0x7F522E1D4EF0 waiting 6.185658427 seconds, NSDThread: on ThCond 0x7F513802AAE8 (0x7F513802AAE8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.6 0x7F522E1D3C00 waiting 2.652370892 seconds, NSDThread: on ThCond 0x7F5130004C78 (0x7F5130004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1D2910 waiting 11.396142225 seconds, NSDThread: on ThCond 0x7F51A401C0C8 (0x7F51A401C0C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.169 0x7F522E1D1620 waiting 63.710723043 seconds, NSDThread: on ThCond 0x7F5038004D08 (0x7F5038004D08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.120 or for massive reads: 0x7FBCE69A8C20 waiting 29.262629530 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE699CEC0 waiting 29.260869141 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE698C5A0 waiting 29.124824888 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6984110 waiting 22.729479654 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69512C0 waiting 29.272805926 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69409A0 waiting 28.833650198 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6924320 waiting 29.237067128 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6921D40 waiting 29.237953228 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6915FE0 waiting 29.046721161 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6913A00 waiting 29.264534710 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6900B00 waiting 29.267691105 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68F7380 waiting 29.266402464 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68D2870 waiting 29.276298231 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68BADB0 waiting 28.665700576 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68B61F0 waiting 29.236878611 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6885980 waiting *144*.530487248 seconds, NSDThread: on ThMutex 0x1803396A670 (0xFFFFC9003396A670) (DiskSchedulingMutex) 0x7FBCE68833A0 waiting 29.231066610 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68820B0 waiting 29.269954514 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE686A5F0 waiting *140*.662994256 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE6864740 waiting 29.254180742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE683FC30 waiting 29.271840565 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE682E020 waiting 29.200969209 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6825B90 waiting 19.136732919 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6805C40 waiting 29.236055550 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FEAA0 waiting 29.283264161 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FC4C0 waiting 29.268992663 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67DFE40 waiting 29.150900786 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D2DF0 waiting 29.199058463 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D1B00 waiting 29.203199738 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67768D0 waiting 29.208231742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6768590 waiting 5.228192589 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67672A0 waiting 29.252839376 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6757C70 waiting 28.869359044 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6748640 waiting 29.289284179 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6734450 waiting 29.253591817 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6730B80 waiting 29.289987273 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6720260 waiting 26.597589551 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66F32C0 waiting 29.177692849 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66E3C90 waiting 29.160268518 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66CC1D0 waiting 5.334330188 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66B3420 waiting 34.274433161 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE668E910 waiting 27.699999488 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6689D50 waiting 34.279090465 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE66805D0 waiting 24.688626241 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6675B60 waiting 35.367745840 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE665E0A0 waiting 29.235994598 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE663CE60 waiting 29.162911979 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' Another example with mmfsadm in case of massive reads: [root at gss02b ~]# mmfsadm dump waiters 0x7F519000AEA0 waiting 28.915010347 seconds, replyCleanupThread: on ThCond 0x7F51101B27B8 (0x7F51101B27B8) (MsgRecordCondvar), reason 'RPC wait' 0x7F511C012A10 waiting 279.522206863 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F52000095F8 (0x7F52000095F8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5120000B80 waiting 279.524782437 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5214000EE8 (0x7F5214000EE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5154006310 waiting 138.164386224 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5174003F08 (0x7F5174003F08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1EB6C0 waiting 23.060703000 seconds, NSDThread: for poll on sock 85 0x7F522E1E6B00 waiting 0.068456104 seconds, NSDThread: on ThCond 0x7F50CC00E478 (0x7F50CC00E478) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1D0330 waiting 17.207907857 seconds, NSDThread: on ThCond 0x7F5078001688 (0x7F5078001688) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1BFA10 waiting 0.181011711 seconds, NSDThread: on ThCond 0x7F504000E558 (0x7F504000E558) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B4FA0 waiting 0.021780338 seconds, NSDThread: on ThCond 0x7F522000E488 (0x7F522000E488) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B3CB0 waiting 0.794718000 seconds, NSDThread: for poll on sock 799 0x7F522E186D10 waiting 0.191606803 seconds, NSDThread: on ThCond 0x7F5184015D58 (0x7F5184015D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E184730 waiting 0.025562000 seconds, NSDThread: for poll on sock 867 0x7F522E12CDD0 waiting 0.008921000 seconds, NSDThread: for poll on sock 543 0x7F522E126F20 waiting 1.459531000 seconds, NSDThread: for poll on sock 983 0x7F522E10F460 waiting 17.177936972 seconds, NSDThread: on ThCond 0x7F51EC002CE8 (0x7F51EC002CE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E101120 waiting 17.232580316 seconds, NSDThread: on ThCond 0x7F51BC005BB8 (0x7F51BC005BB8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0F1AF0 waiting 438.556030000 seconds, NSDThread: for poll on sock 496 0x7F522E0E7080 waiting 393.702839774 seconds, NSDThread: on ThCond 0x7F5164013668 (0x7F5164013668) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E09DA60 waiting 52.746984660 seconds, NSDThread: on ThCond 0x7F506C008858 (0x7F506C008858) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E084CB0 waiting 23.096688206 seconds, NSDThread: on ThCond 0x7F521C008E18 (0x7F521C008E18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0839C0 waiting 0.093456000 seconds, NSDThread: for poll on sock 962 0x7F522E076970 waiting 2.236659731 seconds, NSDThread: on ThCond 0x7F51E0027538 (0x7F51E0027538) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E044E10 waiting 52.752497765 seconds, NSDThread: on ThCond 0x7F513802BDD8 (0x7F513802BDD8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E033200 waiting 16.157355796 seconds, NSDThread: on ThCond 0x7F5104240D58 (0x7F5104240D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E02AD70 waiting 436.025203220 seconds, NSDThread: on ThCond 0x7F50E0016C28 (0x7F50E0016C28) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E01A450 waiting 393.673252777 seconds, NSDThread: on ThCond 0x7F50A8009C18 (0x7F50A8009C18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DFE0460 waiting 1.781358358 seconds, NSDThread: on ThCond 0x7F51E0027638 (0x7F51E0027638) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF99420 waiting 0.038405427 seconds, NSDThread: on ThCond 0x7F50F0172B18 (0x7F50F0172B18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF7CDA0 waiting 438.204625355 seconds, NSDThread: on ThCond 0x7F50900023D8 (0x7F50900023D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF76EF0 waiting 435.903645734 seconds, NSDThread: on ThCond 0x7F5084004BC8 (0x7F5084004BC8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF74910 waiting 21.749325022 seconds, NSDThread: on ThCond 0x7F507C011F48 (0x7F507C011F48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF71040 waiting 1.027274000 seconds, NSDThread: for poll on sock 866 0x7F522DF536D0 waiting 52.953847324 seconds, NSDThread: on ThCond 0x7F5200006FF8 (0x7F5200006FF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF510F0 waiting 0.039278000 seconds, NSDThread: for poll on sock 837 0x7F522DF4EB10 waiting 0.085745937 seconds, NSDThread: on ThCond 0x7F51F0006828 (0x7F51F0006828) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF4C530 waiting 21.850733000 seconds, NSDThread: for poll on sock 986 0x7F522DF4B240 waiting 0.054739884 seconds, NSDThread: on ThCond 0x7F51EC0168D8 (0x7F51EC0168D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF48C60 waiting 0.186409714 seconds, NSDThread: on ThCond 0x7F51E4000908 (0x7F51E4000908) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF41AC0 waiting 438.942861290 seconds, NSDThread: on ThCond 0x7F51CC010168 (0x7F51CC010168) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF3F4E0 waiting 0.060235106 seconds, NSDThread: on ThCond 0x7F51C400A438 (0x7F51C400A438) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF22E60 waiting 0.361288000 seconds, NSDThread: for poll on sock 518 0x7F522DF21B70 waiting 0.060722464 seconds, NSDThread: on ThCond 0x7F51580162D8 (0x7F51580162D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF12540 waiting 23.077564448 seconds, NSDThread: on ThCond 0x7F512C13E1E8 (0x7F512C13E1E8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEFD060 waiting 0.723370000 seconds, NSDThread: for poll on sock 503 0x7F522DEE09E0 waiting 1.565799175 seconds, NSDThread: on ThCond 0x7F5084004D58 (0x7F5084004D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDF6F0 waiting 22.063017342 seconds, NSDThread: on ThCond 0x7F5078003E08 (0x7F5078003E08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDD110 waiting 0.049108780 seconds, NSDThread: on ThCond 0x7F5070001D78 (0x7F5070001D78) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDAB30 waiting 229.603224376 seconds, NSDThread: on ThCond 0x7F50680221B8 (0x7F50680221B8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED7260 waiting 0.071855457 seconds, NSDThread: on ThCond 0x7F506400A5A8 (0x7F506400A5A8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED5F70 waiting 0.648324000 seconds, NSDThread: for poll on sock 766 0x7F522DEC3070 waiting 1.809205756 seconds, NSDThread: on ThCond 0x7F522000E518 (0x7F522000E518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEB1460 waiting 436.017396645 seconds, NSDThread: on ThCond 0x7F51E4000978 (0x7F51E4000978) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEAC8A0 waiting 393.734102000 seconds, NSDThread: for poll on sock 609 0x7F522DEA3120 waiting 17.960778837 seconds, NSDThread: on ThCond 0x7F51B4001708 (0x7F51B4001708) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE86AA0 waiting 23.112060045 seconds, NSDThread: on ThCond 0x7F5154096118 (0x7F5154096118) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE64570 waiting 0.076167410 seconds, NSDThread: on ThCond 0x7F50D8005EF8 (0x7F50D8005EF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE1AF50 waiting 17.460836000 seconds, NSDThread: for poll on sock 737 0x7F522DE104E0 waiting 0.205037000 seconds, NSDThread: for poll on sock 865 0x7F522DDB8B80 waiting 0.106192000 seconds, NSDThread: for poll on sock 78 0x7F522DDA36A0 waiting 0.738921180 seconds, NSDThread: on ThCond 0x7F505400E048 (0x7F505400E048) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD9C500 waiting 0.731118367 seconds, NSDThread: on ThCond 0x7F503C00B518 (0x7F503C00B518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD89600 waiting 229.609363000 seconds, NSDThread: for poll on sock 515 0x7F522DD567B0 waiting 1.508489195 seconds, NSDThread: on ThCond 0x7F514C021F88 (0x7F514C021F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' Another thing worth to mention is that the filesystem its totaly unresponsive. Even a simple "cd" to a directory or an ls to a directory just hangs for several minutes ( litterally). This happens also if i try from the NSD servers. *Few things i have looked into:* * Our network seems fine, there might be some bottleneck on part of them, and this could explain the waiters, but doesnt explain why ad some poit those client ask to expel the NSD servers. THis also doesn't justify why the FS is slow even on NSD itself. * Disk bottleneck? i dont think so. NSD servers have cpu usage (and io wait ) very low. Also mmdiag --iohist seems condirming that the operation on the disks are reasonable fast: === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- ---- ------------------ --------------- 13:54:29.209276 W data 34:5066338808 2056 88.307 lcl sdtu 13:54:29.209277 W data 55:5095698936 2056 27.592 lcl sdaab 13:54:29.209278 W data 171:5104087544 2056 22.801 lcl sdtg 13:54:29.209279 W data 116:5011812856 2056 65.983 lcl sdqr 13:54:29.209280 W data 98:4860817912 2056 17.892 lcl sddl 13:54:29.209281 W data 159:4999229944 2056 21.324 lcl sdjg 13:54:29.209282 W data 84:5049561592 2056 31.932 lcl sdqz 13:54:29.209283 W data 8:5003424248 2056 30.912 lcl sdcw 13:54:29.209284 W data 23:4965675512 2056 27.366 lcl sdpt 13:54:29.297715 W vdiskMDLog 2:144008496 1 0.236 lcl sdkr 13:54:29.297717 W vdiskMDLog 0:331703600 1 0.230 lcl sdcm 13:54:29.297718 W vdiskMDLog 1:273769776 1 0.241 lcl sdbp 13:54:29.244902 W data 51:3857589752 2056 35.566 lcl sdyi 13:54:29.244904 W data 10:3773703672 2056 28.512 lcl sdma 13:54:29.244905 W data 48:3639485944 2056 24.124 lcl sdel 13:54:29.244906 W data 25:3777897976 2056 18.691 lcl sdgt 13:54:29.244908 W data 91:3832423928 2056 20.699 lcl sdlc 13:54:29.244909 W data 115:3723372024 2056 30.783 lcl sdho 13:54:29.244910 W data 173:3882755576 2056 53.241 lcl sdti 13:54:29.244911 W data 42:3782092280 2056 22.785 lcl sddz 13:54:29.244912 W data 45:3647874552 2056 24.289 lcl sdei 13:54:29.244913 W data 32:3652068856 2056 17.220 lcl sdbn 13:54:29.244914 W data 39:3677234680 2056 26.017 lcl sddw 13:54:29.298273 W vdiskMDLog 2:144008497 1 2.522 lcl sduf 13:54:29.298274 W vdiskMDLog 0:331703601 1 1.025 lcl sdlo 13:54:29.298275 W vdiskMDLog 1:273769777 1 2.586 lcl sdtt 13:54:29.288275 W data 27:2249588200 2056 20.071 lcl sdhb 13:54:29.288279 W data 33:2224422376 2056 19.682 lcl sdts 13:54:29.288281 W data 47:2115370472 2056 21.667 lcl sdwo 13:54:29.288282 W data 82:2316697064 2056 21.524 lcl sdxy 13:54:29.288283 W data 85:2232810984 2056 17.467 lcl sdra 13:54:29.288285 W data 30:2127953384 2056 18.475 lcl sdqg 13:54:29.288286 W data 67:1876295144 2056 16.383 lcl sdmx 13:54:29.288287 W data 64:2127953384 2056 21.908 lcl sduh 13:54:29.288288 W data 38:2253782504 2056 19.775 lcl sddv 13:54:29.288290 W data 15:2207645160 2056 20.599 lcl sdet 13:54:29.288291 W data 157:2283142632 2056 21.198 lcl sdiy Bonding problem on the interfaces? Mellanox ( interface card prodicer) drivers and firmware updated, and we even tested the system with a single link ( without bonding). Could someone help me with this? in particular: * What exactly are client are looking to decide that another node is unresponsive? Ping? i dont think so because both NSD servers and clients can be pinged, so what they look? if comeone can also specify what port are they using i can try to tcpdump what exactly is cauding this expell. * How can i monitor metadata operations to understand where EXACTLY is the bottleneck that causes this: [sdinardo at ebi5-001 ~]$ time ls /gpfs/nobackup/sdinardo 1 ebi3-054.ebi.ac.uk ebi3-154 ebi5-019.ebi.ac.uk ebi5-052 ebi5-101 ebi5-156 ebi5-197 ebi5-228 ebi5-262.ebi.ac.uk 10 ebi3-055 ebi3-155 ebi5-021.ebi.ac.uk ebi5-053 ebi5-104.ebi.ac.uk ebi5-160.ebi.ac.uk ebi5-198 ebi5-229 ebi5-263 2 ebi3-056.ebi.ac.uk ebi3-156 ebi5-022 ebi5-054.ebi.ac.uk ebi5-106 ebi5-161 ebi5-200 ebi5-230.ebi.ac.uk ebi5-264 3 ebi3-057 ebi3-157 ebi5-023 ebi5-056 ebi5-109 ebi5-162.ebi.ac.uk ebi5-201 ebi5-231.ebi.ac.uk ebi5-265 4 ebi3-058 ebi3-158.ebi.ac.uk ebi5-024.ebi.ac.uk ebi5-057 ebi5-110.ebi.ac.uk ebi5-163.ebi.ac.uk ebi5-202.ebi.ac.uk ebi5-232 ebi5-266.ebi.ac.uk 5 ebi3-059.ebi.ac.uk ebi3-160 ebi5-025 ebi5-060 ebi5-111.ebi.ac.uk ebi5-164 ebi5-204 ebi5-233 ebi5-267 6 ebi3-132 ebi3-161.ebi.ac.uk ebi5-026 ebi5-061.ebi.ac.uk ebi5-112.ebi.ac.uk ebi5-165 ebi5-205 ebi5-234 ebi5-269.ebi.ac.uk 7 ebi3-133 ebi3-163.ebi.ac.uk ebi5-028 ebi5-062.ebi.ac.uk ebi5-129.ebi.ac.uk ebi5-166 ebi5-206.ebi.ac.uk ebi5-236 ebi5-270 8 ebi3-134 ebi3-165 ebi5-030 ebi5-064 ebi5-131.ebi.ac.uk ebi5-169.ebi.ac.uk ebi5-207 ebi5-237 ebi5-271 9 ebi3-135 ebi3-166.ebi.ac.uk ebi5-031 ebi5-065 ebi5-132 ebi5-170.ebi.ac.uk ebi5-209 ebi5-239.ebi.ac.uk launcher.sh _*real 21m14.948s*_( WTH ?!?!?!) user 0m0.004s sys 0m0.014s I know that the question are not easy to answer, and i need to dig more, but could be very helpful if someone give me some hints about where to look at. My gpfs skills are limited since this is our first system and is in production for just few months, and the things stated to worsen just recenlty. In past we could get over 200Gb/s ( both read and write) without any issue. Now some clients get expelled even when data thoughuput is ad 4-5Gb/s. Thanks in advance for any help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 11:18:10 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 11:18:10 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance Message-ID: Hi all, You may be aware that the website has been down for about a week now. This is due to the amount of traffic to the website and the amount of people on the mailing list, we had seen a few issues on the system. In order to counter the issues, we are moving to a new system to counter any future issues, and ease of management. We are hoping to do this tonight ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then please let me know. I will, as part of the move over, will be sending a few test mails to make sure that mailing list is working correctly. Thanks for your patience -- Arif Ali gpfsug Admin IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Tue Aug 19 12:11:00 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Tue, 19 Aug 2014 12:11:00 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F330C4.808@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 20:59:47 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 20:59:47 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: This is a test mail to the mailing list please do not reply -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 11:18, Arif Ali wrote: > Hi all, > > You may be aware that the website has been down for about a week now. This > is due to the amount of traffic to the website and the amount of people on > the mailing list, we had seen a few issues on the system. > > In order to counter the issues, we are moving to a new system to counter > any future issues, and ease of management. We are hoping to do this tonight > ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then > please let me know. > > I will, as part of the move over, will be sending a few test mails to make > sure that mailing list is working correctly. > > Thanks for your patience > > -- > Arif Ali > gpfsug Admin > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 23:41:48 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 23:41:48 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: Thanks for all your patience, The service should all be back up again -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 20:59, Arif Ali wrote: > This is a test mail to the mailing list > > please do not reply > > -- > Arif Ali > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > > > On 19 August 2014 11:18, Arif Ali wrote: > >> Hi all, >> >> You may be aware that the website has been down for about a week now. >> This is due to the amount of traffic to the website and the amount of >> people on the mailing list, we had seen a few issues on the system. >> >> In order to counter the issues, we are moving to a new system to counter >> any future issues, and ease of management. We are hoping to do this tonight >> ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then >> please let me know. >> >> I will, as part of the move over, will be sending a few test mails to >> make sure that mailing list is working correctly. >> >> Thanks for your patience >> >> -- >> Arif Ali >> gpfsug Admin >> >> IRC: arif-ali at freenode >> LinkedIn: http://uk.linkedin.com/in/arifali >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 08:57:23 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 08:57:23 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F454E3.40803@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 09:03:03 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 09:03:03 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F45637.8080000@ebi.ac.uk> Another interesting case about a specific waiter: was looking the waiters on GSS until i found those( i got those info collecting from all the servers with a script i did, so i was able to trace hanging connection while they was happening): gss03b.ebi.ac.uk:*235.373993397*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*235.152271998*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss02a.ebi.ac.uk:*214.079093620 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109 gss02a.ebi.ac.uk:*213.580199240 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109 gss03b.ebi.ac.uk:*132.375138082*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*132.374973884 *(MsgRecordCondvar), reason 'RPC wait' for commMsgCheckMessages on node 10.7.37.109 the bolted number are seconds. looking at this page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information The web page claim that's, probably a network congestion, but i managed to login quick enough to the client and there the waiters was: [root at ebi5-236 ~]# mmdiag --waiters === mmdiag: waiters === 0x7F6690073460 waiting 147.973009173 seconds, RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0 (0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for LX lock' 0x7F65100036D0 waiting 140.458589856 seconds, WritebehindWorkerThread: on ThCond 0x7F6500000F98 (0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63A0001080 waiting 245.153055801 seconds, WritebehindWorkerThread: on ThCond 0x7F65D8002CF8 (0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C03D3D0 waiting 245.750977203 seconds, CleanBufferThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674802E360 waiting 244.159861966 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0002358 (0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C038810 waiting 251.086748430 seconds, SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398 (0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait' for I/O completion on node 10.7.28.35 0x7F674C036230 waiting 139.556735095 seconds, CleanBufferThread: on ThCond 0x7F65CC004C78 (0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C031670 waiting 144.327593052 seconds, WritebehindWorkerThread: on ThCond 0x7F672402D1A8 (0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C02A4D0 waiting 145.202712821 seconds, WritebehindWorkerThread: on ThCond 0x7F65440018F8 (0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C0291E0 waiting 247.131569232 seconds, PrefetchWorkerThread: on ThCond 0x7F65740016C8 (0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748025BD0 waiting 11.631381523 seconds, replyCleanupThread: on ThCond 0x7F65E000A1F8 (0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait' 0x7F6748022300 waiting 245.616267612 seconds, WritebehindWorkerThread: on ThCond 0x7F6470001468 (0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748021010 waiting 230.769670930 seconds, InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674801B160 waiting 245.830554594 seconds, UnusedInodePrefetchThread: on ThCond 0x7F65B8004438 (0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674800A820 waiting 252.332932000 seconds, Msg handler getData: for poll on sock 109 0x7F63F4023090 waiting 253.073535042 seconds, WritebehindWorkerThread: on ThCond 0x7F65C4000CC8 (0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64A4000CE0 waiting 145.049659249 seconds, WritebehindWorkerThread: on ThCond 0x7F6560000A98 (0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6778006D00 waiting 142.124664264 seconds, WritebehindWorkerThread: on ThCond 0x7F63DC000C08 (0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780046D0 waiting 251.751439453 seconds, WritebehindWorkerThread: on ThCond 0x7F6454000A98 (0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E4B70 waiting 142.431051232 seconds, WritebehindWorkerThread: on ThCond 0x7F63C80010D8 (0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E50D0 waiting 244.339624817 seconds, WritebehindWorkerThread: on ThCond 0x7F65BC001B98 (0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6434000B40 waiting 145.343700410 seconds, WritebehindWorkerThread: on ThCond 0x7F63B00036E8 (0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F670C0187A0 waiting 244.903963969 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0000FB8 (0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04E2F0 waiting 245.837137631 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4000A98 (0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04AA20 waiting 139.713993908 seconds, WritebehindWorkerThread: on ThCond 0x7F6454002478 (0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C049730 waiting 252.434187472 seconds, WritebehindWorkerThread: on ThCond 0x7F65F4003708 (0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C044B70 waiting 131.515829048 seconds, Msg handler ccMsgPing: on ThCond 0x7F64DC1D4888 (0x7F64DC1D4888) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6758008DE0 waiting 149.548547226 seconds, Msg handler getData: on ThCond 0x7F645C002458 (0x7F645C002458) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F67580071D0 waiting 149.548543118 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F6450001C48 (0x7F6450001C48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6448001620 waiting 139.844870446 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0003098 (0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63F4000F80 waiting 245.044791905 seconds, WritebehindWorkerThread: on ThCond 0x7F6450001188 (0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F659C0033A0 waiting 243.464399305 seconds, PrefetchWorkerThread: on ThCond 0x7F6554002598 (0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6514001690 waiting 245.826160463 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4004558 (0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64800012B0 waiting 253.174835511 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0000FB8 (0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6510000EE0 waiting 140.746696039 seconds, WritebehindWorkerThread: on ThCond 0x7F647C000CC8 (0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6754001BB0 waiting 246.336055629 seconds, PrefetchWorkerThread: on ThCond 0x7F6594002498 (0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6420000930 waiting 140.606777450 seconds, WritebehindWorkerThread: on ThCond 0x7F6578002498 (0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744009110 waiting 137.466372831 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65F4007158 (0x7F65F4007158) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67280119F0 waiting 144.173427360 seconds, WritebehindWorkerThread: on ThCond 0x7F6504000AE8 (0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F672800BB40 waiting 145.804301887 seconds, WritebehindWorkerThread: on ThCond 0x7F6550001038 (0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6728000910 waiting 252.601993452 seconds, WritebehindWorkerThread: on ThCond 0x7F6450000A98 (0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744007E20 waiting 251.603329204 seconds, WritebehindWorkerThread: on ThCond 0x7F6570004C18 (0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64AC002EF0 waiting 139.205774422 seconds, FileBlockWriteFetchHandlerThread: on ThCond 0x18020AF0260 (0xFFFFC90020AF0260) (FetchFlowControlCondvar), reason 'wait for buffer for fetch' 0x7F6724013050 waiting 71.501580932 seconds, Msg handler ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F661C000DA0 waiting 245.654985276 seconds, PrefetchWorkerThread: on ThCond 0x7F6570005288 (0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00F440 waiting 251.096002003 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65BC002878 (0x7F65BC002878) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00E150 waiting 144.034006970 seconds, WritebehindWorkerThread: on ThCond 0x7F6528001548 (0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FCD20 waiting 142.324070945 seconds, WritebehindWorkerThread: on ThCond 0x7F6580002A98 (0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FA330 waiting 200.670114385 seconds, EEWatchDogThread: on ThCond 0x7F65B0000A98 (0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait' 0x7F67A02BF050 waiting 252.276161189 seconds, WritebehindWorkerThread: on ThCond 0x7F6584003998 (0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A0004160 waiting 251.173651822 seconds, SyncHandlerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write' So from the client side its the client that's waiting the server. I managed also to ping, ssh, and tcpdump each other before the node got expelled and discovered that ping works fine, ssh work fine , beside my tests there are 0 packet passing between them, LITERALLY. So there is no congestion, no network issues, but the server waits for the client and the client waits the server. This happens until we reach 350 secs ( 10 times the lease time) , then client get expelled. There are no local io waiters that indicates that gss is struggling, there is plenty of bandwith and CPU resources and no network congestion. Seems some sort of deadlock to me, but how can this be explained and hopefully fixed? Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at gpfsug.org Thu Aug 21 09:20:39 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Thu, 21 Aug 2014 09:20:39 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F5ABD7.80107@gpfsug.org> Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: > Still problems. Here some more detailed examples: > > *EXAMPLE 1:* > > *EBI5-220**( CLIENT)** > *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a > reply from node gss02b* > Tue Aug 19 11:03:04.981 2014: Request sent to > (gss02a in GSS.ebi.ac.uk) to expel (gss02b in > GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:04.982 2014: This node will be expelled > from cluster GSS.ebi.ac.uk due to expel msg from IP> (ebi5-220) > Tue Aug 19 11:03:09.319 2014: Cluster Manager connection > broke. Probing cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:03:10.322 2014: Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems. > Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount > invoked. File system: gpfs1 Reason: SGPanic > Tue Aug 19 11:03:12.066 2014: Connecting to > gss02a > Tue Aug 19 11:03:12.070 2014: Connected to > gss02a > Tue Aug 19 11:03:17.071 2014: Connecting to > gss02b > Tue Aug 19 11:03:17.072 2014: Connecting to > gss03b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss03a > Tue Aug 19 11:03:17.080 2014: Connecting to > gss01b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss01a > Tue Aug 19 11:04:23.105 2014: Connected to > gss02b > Tue Aug 19 11:04:23.107 2014: Connected to > gss03b > Tue Aug 19 11:04:23.112 2014: Connected to > gss03a > Tue Aug 19 11:04:23.115 2014: Connected to > gss01b > Tue Aug 19 11:04:23.121 2014: Connected to > gss01a > Tue Aug 19 11:12:28.992 2014: Node (gss02a in > GSS.ebi.ac.uk) is now the Group Leader. > > *GSS02B ( NSD SERVER)* > ... > Tue Aug 19 11:03:17.070 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:25.016 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:28.080 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:36.019 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:39.083 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:47.023 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:50.088 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:52.218 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:58.030 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:01.092 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:03.220 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:09.034 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:12.096 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:14.224 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:20.037 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:23.103 2014: Accepted and connected to > ** ebi5-220 > ... > > *GSS02a ( NSD SERVER)* > Tue Aug 19 11:03:04.980 2014: Expel (gss02b) > request from (ebi5-220 in > ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 > in ebi-cluster.ebi.ac.uk) > Tue Aug 19 11:03:12.069 2014: Accepted and connected to > ebi5-220 > > > =============================================== > *EXAMPLE 2*: > > *EBI5-038* > Tue Aug 19 11:32:34.227 2014: *Disk lease period expired > in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* > Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing > cluster GSS.ebi.ac.uk* > Tue Aug 19 11:35:24.265 2014: Close connection to IP> gss02a (Connection reset by peer). Attempting > reconnect. > Tue Aug 19 11:35:24.865 2014: Close connection to > ebi5-014 (Connection reset by > peer). Attempting reconnect. > ... > LOT MORE RESETS BY PEER > ... > Tue Aug 19 11:35:25.096 2014: Close connection to > ebi5-167 (Connection reset by > peer). Attempting reconnect. > Tue Aug 19 11:35:25.267 2014: Connecting to > gss02a > Tue Aug 19 11:35:25.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:35:26.267 2014: Retry connection to IP> gss02a > Tue Aug 19 11:35:26.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems.* > > *GSS02a* > Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 > in ebi-cluster.ebi.ac.uk) *is being expelled because of an > expired lease.* Pings sent: 60. Replies received: 60. > > > > > In example 1 seems that an NSD was not repliyng to the client, but the > servers seems working fine.. how can i trace better ( to solve) the > problem? > > In example 2 it seems to me that for some reason the manager are not > renewing the lease in time. when this happens , its not a single client. > Loads of them fail to get the lease renewed. Why this is happening? > how can i trace to the source of the problem? > > > > Thanks in advance for any tips. > > Regards, > Salvatore > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 10:04:47 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 10:04:47 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5ABD7.80107@gpfsug.org> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> Message-ID: <53F5B62F.1060305@ebi.ac.uk> Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called */input_file/* of nearly 100GB: if from *client A* we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) *client B* : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: > Hi there, > > I've seen the on several 'stock'? 'core'? GPFS system (we need a > better term now GSS is out) and seen ping 'working', but alongside > ejections from the cluster. > The GPFS internode 'ping' is somewhat more circumspect than unix ping > - and rightly so. > > In my experience this has _always_ been a network issue of one sort of > another. If the network is experiencing issues, nodes will be ejected. > Of course it could be unresponsive mmfsd or high loadavg, but I've > seen that only twice in 10 years over many versions of GPFS. > > You need to follow the logs through from each machine in time order to > determine who could not see who and in what order. > Your best way forward is to log a SEV2 case with IBM support, directly > or via your OEM and collect and supply a snap and traces as required > by support. > > Without knowing your full setup, it's hard to help further. > > Jez > > On 20/08/14 08:57, Salvatore Di Nardo wrote: >> Still problems. Here some more detailed examples: >> >> *EXAMPLE 1:* >> >> *EBI5-220**( CLIENT)** >> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >> reply from node gss02b* >> Tue Aug 19 11:03:04.981 2014: Request sent to >> (gss02a in GSS.ebi.ac.uk) to expel (gss02b in >> GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:04.982 2014: This node will be expelled >> from cluster GSS.ebi.ac.uk due to expel msg from >> (ebi5-220) >> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >> broke. Probing cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:10.321 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems. >> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >> invoked. File system: gpfs1 Reason: SGPanic >> Tue Aug 19 11:03:12.066 2014: Connecting to >> gss02a >> Tue Aug 19 11:03:12.070 2014: Connected to >> gss02a >> Tue Aug 19 11:03:17.071 2014: Connecting to >> gss02b >> Tue Aug 19 11:03:17.072 2014: Connecting to >> gss03b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss03a >> Tue Aug 19 11:03:17.080 2014: Connecting to >> gss01b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss01a >> Tue Aug 19 11:04:23.105 2014: Connected to >> gss02b >> Tue Aug 19 11:04:23.107 2014: Connected to >> gss03b >> Tue Aug 19 11:04:23.112 2014: Connected to >> gss03a >> Tue Aug 19 11:04:23.115 2014: Connected to >> gss01b >> Tue Aug 19 11:04:23.121 2014: Connected to >> gss01a >> Tue Aug 19 11:12:28.992 2014: Node (gss02a in >> GSS.ebi.ac.uk) is now the Group Leader. >> >> *GSS02B ( NSD SERVER)* >> ... >> Tue Aug 19 11:03:17.070 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:25.016 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:28.080 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:36.019 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:39.083 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:47.023 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:50.088 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:52.218 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:58.030 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:01.092 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:03.220 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:09.034 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:12.096 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:14.224 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:20.037 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >> ** ebi5-220 >> ... >> >> *GSS02a ( NSD SERVER)* >> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >> request from (ebi5-220 in >> ebi-cluster.ebi.ac.uk). Expelling: >> (ebi5-220 in ebi-cluster.ebi.ac.uk) >> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >> ebi5-220 >> >> >> =============================================== >> *EXAMPLE 2*: >> >> *EBI5-038* >> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >> cluster GSS.ebi.ac.uk* >> Tue Aug 19 11:35:24.265 2014: Close connection to > IP> gss02a (Connection reset by peer). Attempting >> reconnect. >> Tue Aug 19 11:35:24.865 2014: Close connection to >> ebi5-014 (Connection reset by >> peer). Attempting reconnect. >> ... >> LOT MORE RESETS BY PEER >> ... >> Tue Aug 19 11:35:25.096 2014: Close connection to >> ebi5-167 (Connection reset by >> peer). Attempting reconnect. >> Tue Aug 19 11:35:25.267 2014: Connecting to >> gss02a >> Tue Aug 19 11:35:25.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:35:26.267 2014: Retry connection to > IP> gss02a >> Tue Aug 19 11:35:26.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:36:24.276 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems.* >> >> *GSS02a* >> Tue Aug 19 11:35:24.263 2014: Node >> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >> because of an expired lease.* Pings sent: 60. Replies >> received: 60. >> >> >> >> >> In example 1 seems that an NSD was not repliyng to the client, but >> the servers seems working fine.. how can i trace better ( to solve) >> the problem? >> >> In example 2 it seems to me that for some reason the manager are not >> renewing the lease in time. when this happens , its not a single client. >> Loads of them fail to get the lease renewed. Why this is happening? >> how can i trace to the source of the problem? >> >> >> >> Thanks in advance for any tips. >> >> Regards, >> Salvatore >> >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Aug 21 13:48:38 2014 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 21 Aug 2014 12:48:38 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>,<53F5B62F.1060305@ebi.ac.uk> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. [cid:2fb2253c-3ffb-4ac6-88a8-d019b1a24f66] Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From jbernard at jumptrading.com Thu Aug 21 13:52:05 2014 From: jbernard at jumptrading.com (Jon Bernard) Date: Thu, 21 Aug 2014 12:52:05 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk>, <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: Where is that from? On Aug 21, 2014, at 7:49, "Bryan Banister" > wrote: As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From viccornell at gmail.com Thu Aug 21 14:03:14 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:03:14 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Hi Salvatore, Are you using ethernet or infiniband as the GPFS interconnect to your clients? If 10/40GbE - do you have a separate admin network? I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. Vic On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: > Thanks for the feedback, but we managed to find a scenario that excludes network problems. > > we have a file called input_file of nearly 100GB: > > if from client A we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on the same client) client B : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. > Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. > > > > As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. > Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> EXAMPLE 1: >>> EBI5-220 ( CLIENT) >>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> GSS02B ( NSD SERVER) >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>> ... >>> >>> GSS02a ( NSD SERVER) >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>> >>> >>> =============================================== >>> EXAMPLE 2: >>> >>> EBI5-038 >>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> >>> GSS02a >>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:04:59 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:04:59 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk> <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: <53F5EE7B.2080306@ebi.ac.uk> Thanks for the info... it helps a bit understanding whats going on, but i think you missed the part that Node A and Node B could also be the same machine. If for instance i ran 2 cp on the same machine, hence Client B cannot have problems contacting Client A since they are the same machine..... BTW i did the same also using 2 clients and the result its the same. Nonetheless your description is made me understand a bit better what's going on Regards, Salvatore On 21/08/14 13:48, Bryan Banister wrote: > As I understand GPFS distributed locking semantics, GPFS will not > allow one node to hold a write lock for a file indefinitely. Once > Client B opens the file for writing it would have contacted the File > System Manager to obtain the lock. The FS manager would have told > Client B that Client A has the lock and that Client B would have to > contact Client A and revoke the write lock token. If Client A does > not respond to Client B's request to revoke the write token, then > Client B will ask that Client A be expelled from the cluster for NOT > adhering to the proper protocol for write lock contention. > > > > Have you checked the communication path between the two clients at > this point? > > I could not follow the logs that you provided. You should definitely > look at the exact sequence of log events on the two clients and the > file system manager (as reported by mmlsmgr). > > Hope that helps, > -Bryan > > ------------------------------------------------------------------------ > *From:* gpfsug-discuss-bounces at gpfsug.org > [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo > [sdinardo at ebi.ac.uk] > *Sent:* Thursday, August 21, 2014 4:04 AM > *To:* chair at gpfsug.org; gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] gpfs client expels > > Thanks for the feedback, but we managed to find a scenario that > excludes network problems. > > we have a file called */input_file/* of nearly 100GB: > > if from *client A* we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they > flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on > the same client) *client B* : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all > the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for > one of the "cat", so have to wait ( and since the file is BIG , have > to wait for a while), its not understandable why it stop the renewal > lease. > Why its doen't return just a timeout error on the copy instead to > expel the node? We can reproduce this every time, and since our users > to operations like this on files over 100GB each you can imagine the > result. > > > > As you can imagine even if its a bit silly to write at the same time > to the same destination, its also quite common if we want to dump to a > log file logs and for some reason one of the writers, write for a lot > of time keeping the file locked. > Our expels are not due to network congestion, but because a write > attempts have to wait another one. What i really dont understand is > why to take a so expreme mesure to expell jest because a process is > waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under > investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a >> better term now GSS is out) and seen ping 'working', but alongside >> ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping >> - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort >> of another. If the network is experiencing issues, nodes will be >> ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've >> seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order >> to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, >> directly or via your OEM and collect and supply a snap and traces as >> required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> *EXAMPLE 1:* >>> >>> *EBI5-220**( CLIENT)** >>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>> reply from node gss02b* >>> Tue Aug 19 11:03:04.981 2014: Request sent to >> IP> (gss02a in GSS.ebi.ac.uk) to expel >>> (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled >>> from cluster GSS.ebi.ac.uk due to expel msg from >>> (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >>> broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >>> GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>> invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to >>> gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to >>> gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to >>> gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to >>> gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to >>> gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to >>> gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to >>> gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to >>> gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to >>> gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>> in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> *GSS02B ( NSD SERVER)* >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>> ** ebi5-220 >>> ... >>> >>> *GSS02a ( NSD SERVER)* >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >>> request from (ebi5-220 in >>> ebi-cluster.ebi.ac.uk). Expelling: >>> (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>> ebi5-220 >>> >>> >>> =============================================== >>> *EXAMPLE 2*: >>> >>> *EBI5-038* >>> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >>> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >>> cluster GSS.ebi.ac.uk* >>> Tue Aug 19 11:35:24.265 2014: Close connection to >>> gss02a (Connection reset by peer). >>> Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to >>> ebi5-014 (Connection reset by >>> peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to >>> ebi5-167 (Connection reset by >>> peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>> gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>> cluster GSS.ebi.ac.uk. Unmounting file systems.* >>> >>> *GSS02a* >>> Tue Aug 19 11:35:24.263 2014: Node >>> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >>> because of an expired lease.* Pings sent: 60. Replies >>> received: 60. >>> >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but >>> the servers seems working fine.. how can i trace better ( to solve) >>> the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not >>> renewing the lease in time. when this happens , its not a single >>> client. >>> Loads of them fail to get the lease renewed. Why this is happening? >>> how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged > information. If you are not the intended recipient, you are hereby > notified that any review, dissemination or copying of this email is > strictly prohibited, and to please notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does > not make any guarantees as to the completeness or accuracy of this > email or any attachments. This email is for informational purposes > only and does not constitute a recommendation, offer, request or > solicitation of any kind to buy, sell, subscribe, redeem or perform > any type of transaction of a financial product. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 249179 bytes Desc: not available URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:18:19 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:18:19 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Message-ID: <53F5F19B.1010603@ebi.ac.uk> This is an interesting point! We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. I will forward this consideration to our network admin , so see if we can work on a dedicated network. thanks for your tip. Regards, Salvatore On 21/08/14 14:03, Vic Cornell wrote: > Hi Salvatore, > > Are you using ethernet or infiniband as the GPFS interconnect to your > clients? > > If 10/40GbE - do you have a separate admin network? > > I have seen behaviour similar to this where the storage traffic causes > congestion and the "admin" traffic gets lost or delayed causing expels. > > Vic > > > > On 21 Aug 2014, at 10:04, Salvatore Di Nardo > wrote: > >> Thanks for the feedback, but we managed to find a scenario that >> excludes network problems. >> >> we have a file called */input_file/* of nearly 100GB: >> >> if from *client A* we do: >> >> cat input_file >> output_file >> >> it start copying.. and we see waiter goeg a bit up,secs but then they >> flushes back to 0, so we xcan say that the copy proceed well... >> >> >> if now we do the same from another client ( or just another shell on >> the same client) *client B* : >> >> cat input_file >> output_file >> >> >> ( in other words we are trying to write to the same destination) all >> the waiters gets up until one node get expelled. >> >> >> Now, while its understandable that the destination file is locked for >> one of the "cat", so have to wait ( and since the file is BIG , have >> to wait for a while), its not understandable why it stop the renewal >> lease. >> Why its doen't return just a timeout error on the copy instead to >> expel the node? We can reproduce this every time, and since our users >> to operations like this on files over 100GB each you can imagine the >> result. >> >> >> >> As you can imagine even if its a bit silly to write at the same time >> to the same destination, its also quite common if we want to dump to >> a log file logs and for some reason one of the writers, write for a >> lot of time keeping the file locked. >> Our expels are not due to network congestion, but because a write >> attempts have to wait another one. What i really dont understand is >> why to take a so expreme mesure to expell jest because a process is >> waiteing "to too much time". >> >> >> I have ticket opened to IBM for this and the issue is under >> investigation, but no luck so far.. >> >> Regards, >> Salvatore >> >> >> >> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>> Hi there, >>> >>> I've seen the on several 'stock'? 'core'? GPFS system (we need a >>> better term now GSS is out) and seen ping 'working', but alongside >>> ejections from the cluster. >>> The GPFS internode 'ping' is somewhat more circumspect than unix >>> ping - and rightly so. >>> >>> In my experience this has _always_ been a network issue of one sort >>> of another. If the network is experiencing issues, nodes will be >>> ejected. >>> Of course it could be unresponsive mmfsd or high loadavg, but I've >>> seen that only twice in 10 years over many versions of GPFS. >>> >>> You need to follow the logs through from each machine in time order >>> to determine who could not see who and in what order. >>> Your best way forward is to log a SEV2 case with IBM support, >>> directly or via your OEM and collect and supply a snap and traces as >>> required by support. >>> >>> Without knowing your full setup, it's hard to help further. >>> >>> Jez >>> >>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>> Still problems. Here some more detailed examples: >>>> >>>> *EXAMPLE 1:* >>>> >>>> *EBI5-220**( CLIENT)** >>>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>>> reply from node gss02b* >>>> Tue Aug 19 11:03:04.981 2014: Request sent to >>> IP> (gss02a in GSS.ebi.ac.uk ) to >>>> expel (gss02b in GSS.ebi.ac.uk >>>> ) from cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:04.982 2014: This node will be >>>> expelled from cluster GSS.ebi.ac.uk >>>> due to expel msg from >>> IP> (ebi5-220) >>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager >>>> connection broke. Probing cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:03:10.322 2014: Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems. >>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>>> invoked. File system: gpfs1 Reason: SGPanic >>>> Tue Aug 19 11:03:12.066 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:03:12.070 2014: Connected to >>>> gss02a >>>> Tue Aug 19 11:03:17.071 2014: Connecting to >>>> gss02b >>>> Tue Aug 19 11:03:17.072 2014: Connecting to >>>> gss03b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss03a >>>> Tue Aug 19 11:03:17.080 2014: Connecting to >>>> gss01b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss01a >>>> Tue Aug 19 11:04:23.105 2014: Connected to >>>> gss02b >>>> Tue Aug 19 11:04:23.107 2014: Connected to >>>> gss03b >>>> Tue Aug 19 11:04:23.112 2014: Connected to >>>> gss03a >>>> Tue Aug 19 11:04:23.115 2014: Connected to >>>> gss01b >>>> Tue Aug 19 11:04:23.121 2014: Connected to >>>> gss01a >>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>>> in GSS.ebi.ac.uk ) is now the >>>> Group Leader. >>>> >>>> *GSS02B ( NSD SERVER)* >>>> ... >>>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>>> ** ebi5-220 >>>> ... >>>> >>>> *GSS02a ( NSD SERVER)* >>>> Tue Aug 19 11:03:04.980 2014: Expel >>>> (gss02b) request from (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ). >>>> Expelling: (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ) >>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>>> ebi5-220 >>>> >>>> >>>> =============================================== >>>> *EXAMPLE 2*: >>>> >>>> *EBI5-038* >>>> Tue Aug 19 11:32:34.227 2014: *Disk lease period >>>> expired in cluster GSS.ebi.ac.uk >>>> . Attempting to reacquire lease.* >>>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. >>>> Probing cluster GSS.ebi.ac.uk * >>>> Tue Aug 19 11:35:24.265 2014: Close connection to >>>> gss02a (Connection reset by peer). >>>> Attempting reconnect. >>>> Tue Aug 19 11:35:24.865 2014: Close connection to >>>> ebi5-014 (Connection reset by >>>> peer). Attempting reconnect. >>>> ... >>>> LOT MORE RESETS BY PEER >>>> ... >>>> Tue Aug 19 11:35:25.096 2014: Close connection to >>>> ebi5-167 (Connection reset by >>>> peer). Attempting reconnect. >>>> Tue Aug 19 11:35:25.267 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:35:25.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>>> gss02a >>>> Tue Aug 19 11:35:26.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems.* >>>> >>>> *GSS02a* >>>> Tue Aug 19 11:35:24.263 2014: Node >>>> (ebi5-038 in ebi-cluster.ebi.ac.uk >>>> ) *is being expelled >>>> because of an expired lease.* Pings sent: 60. Replies >>>> received: 60. >>>> >>>> >>>> >>>> >>>> In example 1 seems that an NSD was not repliyng to the client, but >>>> the servers seems working fine.. how can i trace better ( to solve) >>>> the problem? >>>> >>>> In example 2 it seems to me that for some reason the manager are >>>> not renewing the lease in time. when this happens , its not a >>>> single client. >>>> Loads of them fail to get the lease renewed. Why this is happening? >>>> how can i trace to the source of the problem? >>>> >>>> >>>> >>>> Thanks in advance for any tips. >>>> >>>> Regards, >>>> Salvatore >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss atgpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss atgpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Aug 21 14:19:33 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 21 Aug 2014 15:19:33 +0200 (CEST) Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <1481989063.92260.1408627173332.open-xchange@oxbaltgw09.schlund.de> > Now, while its understandable that the destination file is locked for one of > the "cat", so have to wait If GPFS is posix compatible i do not understand why a cat should block the other cat completly meanings on a standard FS you can "cat" from many source to the same target. Of course the result is not predictable. >From this point of view i would expect that both "cat" would start writing immediately thus i would expect a GPFS bug. All imho. Hajo Note: You might test which the input_file in a different directory and i would test the behaviour if the output_file is on a local FS like /tmp. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Thu Aug 21 14:22:22 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:22:22 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5F19B.1010603@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> <53F5F19B.1010603@ebi.ac.uk> Message-ID: <0F03996A-2008-4076-9A2B-B4B2BB89E959@gmail.com> For my system I always use a dedicated admin network - as described in the gpfs manuals - for a gpfs cluster on 10/40GbE where the system will be heavily loaded. The difference in the stability of the system is very noticeable. Not sure how/if this would work on GSS - IBM ought to know :-) Vic On 21 Aug 2014, at 14:18, Salvatore Di Nardo wrote: > This is an interesting point! > > We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. > > Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. > > I will forward this consideration to our network admin , so see if we can work on a dedicated network. > > thanks for your tip. > > Regards, > Salvatore > > > > > On 21/08/14 14:03, Vic Cornell wrote: >> Hi Salvatore, >> >> Are you using ethernet or infiniband as the GPFS interconnect to your clients? >> >> If 10/40GbE - do you have a separate admin network? >> >> I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. >> >> Vic >> >> >> >> On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: >> >>> Thanks for the feedback, but we managed to find a scenario that excludes network problems. >>> >>> we have a file called input_file of nearly 100GB: >>> >>> if from client A we do: >>> >>> cat input_file >> output_file >>> >>> it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... >>> >>> >>> if now we do the same from another client ( or just another shell on the same client) client B : >>> >>> cat input_file >> output_file >>> >>> >>> ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. >>> >>> >>> Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. >>> Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. >>> >>> >>> >>> As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. >>> Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". >>> >>> >>> I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>>> Hi there, >>>> >>>> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >>>> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >>>> >>>> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >>>> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >>>> >>>> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >>>> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >>>> >>>> Without knowing your full setup, it's hard to help further. >>>> >>>> Jez >>>> >>>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>>> Still problems. Here some more detailed examples: >>>>> >>>>> EXAMPLE 1: >>>>> EBI5-220 ( CLIENT) >>>>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>>>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>>>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>>>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>>>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>>>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>>>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>>>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>>>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>>>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>>>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>>>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>>>> >>>>> GSS02B ( NSD SERVER) >>>>> ... >>>>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>>>> ... >>>>> >>>>> GSS02a ( NSD SERVER) >>>>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>>>> >>>>> >>>>> =============================================== >>>>> EXAMPLE 2: >>>>> >>>>> EBI5-038 >>>>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>>>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>>>> ... >>>>> LOT MORE RESETS BY PEER >>>>> ... >>>>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>>>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>>>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> >>>>> GSS02a >>>>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>>>> >>>>> >>>>> >>>>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>>>> >>>>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>>>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>>>> >>>>> >>>>> >>>>> Thanks in advance for any tips. >>>>> >>>>> Regards, >>>>> Salvatore >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 22 10:37:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 22 Aug 2014 10:37:42 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F70F66.2010405@ebi.ac.uk> Hello everyone, Just to let you know, we found the cause of our problems. We discovered that not all of the recommend kernel setting was configured on the clients ( on server was everything ok, but the clients had some setting missing ), and IBM support pointed to this document that describes perfectly our issues and the fix wich suggest to raise some parameters even higher than the standard "best practice" : http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5091222 Thanks to everyone for the replies. Regards, Salvatore From ewahl at osc.edu Mon Aug 25 19:55:08 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 25 Aug 2014 18:55:08 +0000 Subject: [gpfsug-discuss] CNFS using NFS over RDMA? Message-ID: Anyone out there doing CNFS with NFS over RDMA? Is this even possible? We currently have been delivering some CNFS services using TCP over IB, but that layer tends to have a large number of bugs all the time. Like to take a look at moving back down to verbs... Ed Wahl OSC -------------- next part -------------- An HTML attachment was scrubbed... URL: From zander at ebi.ac.uk Fri Aug 1 14:44:49 2014 From: zander at ebi.ac.uk (Zander Mears) Date: Fri, 01 Aug 2014 14:44:49 +0100 Subject: [gpfsug-discuss] Hello! In-Reply-To: <53D981EF.3020000@gpfsug.org> References: <53D8C897.9000902@ebi.ac.uk> <53D981EF.3020000@gpfsug.org> Message-ID: <53DB99D1.8050304@ebi.ac.uk> Hi Jez We're just monitoring the standard OS stuff, some interface errors, throughput, number of network and gpfs connections due to previous issues. We don't really know as yet what is good to monitor GPFS wise. cheers Zander On 31/07/2014 00:38, Jez Tucker (Chair) wrote: > Hi Zander, > > We have a git repository. Would you be interested in adding any > Zabbix custom metrics gathering to GPFS to it? > > https://github.com/gpfsug/gpfsug-tools > > Best, > > Jez From sfadden at us.ibm.com Tue Aug 5 18:55:20 2014 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 5 Aug 2014 10:55:20 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are doing it, I am looking for some confirmation. Thanks Scott Fadden GPFS Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/gpfs -------------- next part -------------- An HTML attachment was scrubbed... URL: From u.sibiller at science-computing.de Wed Aug 6 08:46:31 2014 From: u.sibiller at science-computing.de (Ulrich Sibiller) Date: Wed, 06 Aug 2014 09:46:31 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1DD57.90103@science-computing.de> Am 05.08.2014 19:55, schrieb Scott Fadden: > Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are > doing it, I am looking for some confirmation. I have some nodes running lustre 2.1.6 or 2.5.58 and gpfs 3.5.0.17 on RHEL5.8 and RHEL6.5. None of them are servers. Kind regards, Ulrich Sibiller -- ______________________________________creating IT solutions Dipl.-Inf. Ulrich Sibiller science + computing ag System Administration Hagellocher Weg 73 mail nfz at science-computing.de 72070 Tuebingen, Germany hotline +49 7071 9457 674 http://www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 From frederik.ferner at diamond.ac.uk Wed Aug 6 10:19:35 2014 From: frederik.ferner at diamond.ac.uk (Frederik Ferner) Date: Wed, 6 Aug 2014 10:19:35 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1F327.1000605@diamond.ac.uk> On 05/08/14 18:55, Scott Fadden wrote: > Is anyone running GPFS and Lustre on the same nodes. I have seen it > work, I have heard people are doing it, I am looking for some confirmation. Most of our compute cluster nodes are clients for Lustre and GPFS at the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on servers (GPFS NSD server or Lustre OSS/MDS servers). HTH, Frederik -- Frederik Ferner Senior Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From sdinardo at ebi.ac.uk Wed Aug 6 10:57:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 06 Aug 2014 10:57:44 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1F327.1000605@diamond.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> Message-ID: <53E1FC18.6080707@ebi.ac.uk> Sorry for this little ot, but recetly i'm looking to Lustre to understand how it is comparable to GPFS in terms of performance, reliability and easy to use. Could anyone share their experience ? My company just recently got a first GPFS system , based on IBM GSS, but while its good performance wise, there are few unresolved problems and the IBM support is almost unexistent, so I'm starting to wonder if its work to look somewhere else eventual future purchases. Salvatore On 06/08/14 10:19, Frederik Ferner wrote: > On 05/08/14 18:55, Scott Fadden wrote: >> Is anyone running GPFS and Lustre on the same nodes. I have seen it >> work, I have heard people are doing it, I am looking for some >> confirmation. > > Most of our compute cluster nodes are clients for Lustre and GPFS at > the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on > servers (GPFS NSD server or Lustre OSS/MDS servers). > > HTH, > Frederik > From chair at gpfsug.org Wed Aug 6 11:19:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 06 Aug 2014 11:19:24 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1FC18.6080707@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> Message-ID: <53E2012C.9040402@gpfsug.org> "IBM support is almost unexistent" I don't find that at all. Do you log directly via ESC or via your OEM/integrator or are you only referring to GSS support rather than pure GPFS? If you are having response issues, your IBM rep (or a few folks on here) can accelerate issues for you. Jez On 06/08/14 10:57, Salvatore Di Nardo wrote: > Sorry for this little ot, but recetly i'm looking to Lustre to > understand how it is comparable to GPFS in terms of performance, > reliability and easy to use. > Could anyone share their experience ? > > My company just recently got a first GPFS system , based on IBM GSS, > but while its good performance wise, there are few unresolved problems > and the IBM support is almost unexistent, so I'm starting to wonder if > its work to look somewhere else eventual future purchases. > > > Salvatore > > On 06/08/14 10:19, Frederik Ferner wrote: >> On 05/08/14 18:55, Scott Fadden wrote: >>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>> work, I have heard people are doing it, I am looking for some >>> confirmation. >> >> Most of our compute cluster nodes are clients for Lustre and GPFS at >> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >> servers (GPFS NSD server or Lustre OSS/MDS servers). >> >> HTH, >> Frederik >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Wed Aug 6 14:26:47 2014 From: service at metamodul.com (service at metamodul.com) Date: Wed, 6 Aug 2014 15:26:47 +0200 (CEST) Subject: [gpfsug-discuss] Hi , i am new to this list Message-ID: <1366482624.222989.1407331607965.open-xchange@oxbaltgw55.schlund.de> Hi @ALL i am Hajo Ehlers , an AIX and GPFS specialist ( Unix System Engineer ). You find me at the IBM GPFS Forum and sometimes at news:c.u.a and I am addicted to cluster filesystems My latest idee is an SAP-HANA light system ( DBMS on an in-memory cluster posix FS ) which could be extended to a "reinvented" Cluster based AS/400 ^_^ I wrote also a small script to do a sequential backup of GPFS filesystems since i got never used to mmbackup - i named it "pdsmc" for parallel dsmc". Cheers Hajo BTW: Please let me know - service (at) metamodul (dot) com - In case somebody is looking for a GPFS specialist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 8 10:53:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 08 Aug 2014 10:53:36 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E2012C.9040402@gpfsug.org> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> Message-ID: <53E49E20.1090905@ebi.ac.uk> Well, i didn't wanted to start a rant against IBM, and I'm referring specifically to GSS. Since GSS its an appliance, we have to refer to GSS support for both hardware and software issues. Hardware support in total crap. It took 1 mounth of chasing and shouting to get a drawer replacement that was causing some issues. Meanwhile 10 disks in that drawer got faulty. Finally we got the drawer replace but the disks are still faulty. Now its 3 days i'm triing to get them fixed or replaced ( its not clear if they disks are broken of they was just marked to be replaced because of the drawer). Right now i dont have any answer about how to put them online ( mmchcarrier don't work because it recognize that the disk where not replaced) There are also few other cases ( gpfs related) open that are still not answered. I have no experience with direct GPFS support, but if i open a case to GSS for a GPFS problem, the cases seems never get an answer. The only reason that GSS is working its because _*I*_**installed it spending few months studying gpfs. So now I'm wondering if its worth at all rely in future on the whole appliance concept. I'm wondering if in future its better just purchase the hardware and install GPFS by our own, or in alternatively even try Lustre. Now, skipping all this GSS rant, which have nothing to do with the file system anyway and going back to my question: Could someone point the main differences between GPFS and Lustre? I found some documentation about Lustre and i'm going to have a look, but oddly enough have not found any practical comparison between them. On 06/08/14 11:19, Jez Tucker (Chair) wrote: > "IBM support is almost unexistent" > > I don't find that at all. > Do you log directly via ESC or via your OEM/integrator or are you only > referring to GSS support rather than pure GPFS? > > If you are having response issues, your IBM rep (or a few folks on > here) can accelerate issues for you. > > Jez > > > On 06/08/14 10:57, Salvatore Di Nardo wrote: >> Sorry for this little ot, but recetly i'm looking to Lustre to >> understand how it is comparable to GPFS in terms of performance, >> reliability and easy to use. >> Could anyone share their experience ? >> >> My company just recently got a first GPFS system , based on IBM GSS, >> but while its good performance wise, there are few unresolved >> problems and the IBM support is almost unexistent, so I'm starting to >> wonder if its work to look somewhere else eventual future purchases. >> >> >> Salvatore >> >> On 06/08/14 10:19, Frederik Ferner wrote: >>> On 05/08/14 18:55, Scott Fadden wrote: >>>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>>> work, I have heard people are doing it, I am looking for some >>>> confirmation. >>> >>> Most of our compute cluster nodes are clients for Lustre and GPFS at >>> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >>> servers (GPFS NSD server or Lustre OSS/MDS servers). >>> >>> HTH, >>> Frederik >>> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpro at bas.ac.uk Fri Aug 8 12:40:00 2014 From: jpro at bas.ac.uk (Jeremy Robst) Date: Fri, 8 Aug 2014 12:40:00 +0100 (BST) Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E49E20.1090905@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > Now, skipping all this GSS rant, which have nothing to do with the file > system anyway? and? going back to my question: > > Could someone point the main differences between GPFS and Lustre? I'm looking at making the same decision here - to buy GPFS or to roll our own Lustre configuration. I'm in the process of setting up test systems, and so far the main difference seems to be in the that in GPFS each server sees the full filesystem, and so you can run other applications (e.g backup) on a GPFS server whereas the Luste OSS (object storage servers) see only a portion of the storage (the filesystem is striped across the OSSes), so you need a Lustre client to mount the full filesystem for things like backup. However I have very little practical experience of either and would also be interested in any comments. Thanks Jeremy -- jpro at bas.ac.uk | (work) 01223 221402 (fax) 01223 362616 Unix System Administrator - British Antarctic Survey #include From keith at ocf.co.uk Fri Aug 8 14:12:39 2014 From: keith at ocf.co.uk (Keith Vickers) Date: Fri, 8 Aug 2014 14:12:39 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: http://www.pdsw.org/pdsw10/resources/posters/parallelNASFSs.pdf Has a good direct apples to apples comparison between Lustre and GPFS. It's pretty much abstractable from the hardware used. Keith Vickers Business Development Manager OCF plc Mobile: 07974 397863 From sergi.more at bsc.es Fri Aug 8 14:14:33 2014 From: sergi.more at bsc.es (=?ISO-8859-1?Q?Sergi_Mor=E9_Codina?=) Date: Fri, 08 Aug 2014 15:14:33 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: <53E4CD39.7080808@bsc.es> Hi all, About main differences between GPFS and Lustre, here you have some bits from our experience: -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. Best Regards, Sergi. On 08/08/2014 01:40 PM, Jeremy Robst wrote: > On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > >> Now, skipping all this GSS rant, which have nothing to do with the file >> system anyway and going back to my question: >> >> Could someone point the main differences between GPFS and Lustre? > > I'm looking at making the same decision here - to buy GPFS or to roll > our own Lustre configuration. I'm in the process of setting up test > systems, and so far the main difference seems to be in the that in GPFS > each server sees the full filesystem, and so you can run other > applications (e.g backup) on a GPFS server whereas the Luste OSS (object > storage servers) see only a portion of the storage (the filesystem is > striped across the OSSes), so you need a Lustre client to mount the full > filesystem for things like backup. > > However I have very little practical experience of either and would also > be interested in any comments. > > Thanks > > Jeremy > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- ------------------------------------------------------------------------ Sergi More Codina Barcelona Supercomputing Center Centro Nacional de Supercomputacion WWW: http://www.bsc.es Tel: +34-93-405 42 27 e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 ------------------------------------------------------------------------ WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3242 bytes Desc: S/MIME Cryptographic Signature URL: From viccornell at gmail.com Fri Aug 8 18:15:30 2014 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 8 Aug 2014 18:15:30 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E4CD39.7080808@bsc.es> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> Message-ID: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at us.ibm.com Fri Aug 8 20:09:44 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 8 Aug 2014 12:09:44 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Message-ID: Vic, Sergi, you can not compare Lustre and GPFS without providing a clear usecase as otherwise you compare apple with oranges. the reason for this is quite simple, Lustre plays well in pretty much one usecase - HPC, GPFS on the other hand is used in many forms of deployments from Storage for Virtual Machines, HPC, Scale-Out NAS, Solutions in digital media, to hosting some of the biggest, most business critical Transactional database installations in the world. you look at 2 products with completely different usability spectrum, functions and features unless as said above you narrow it down to a very specific usecase with a lot of details. even just HPC has a very large spectrum and not everybody is working in a single directory, which is the main scale point for Lustre compared to GPFS and the reason is obvious, if you have only 1 active metadata server (which is what 99% of all lustre systems run) some operations like single directory contention is simpler to make fast, but only up to the limit of your one node, but what happens when you need to go beyond that and only a real distributed architecture can support your workload ? for example look at most chip design workloads, which is a form of HPC, it is something thats extremely metadata and small file dominated, you talk about 100's of millions (in some cases even billions) of files, majority of them <4k, the rest larger files , majority of it with random access patterns that benefit from massive client side caching and distributed data coherency models supported by GPFS token manager infrastructure across 10's or 100's of metadata server and 1000's of compute nodes. you also need to look at the rich feature set GPFS provides, which not all may be important for some environments but are for others like Snapshot, Clones, Hierarchical Storage Management (ILM) , Local Cache acceleration (LROC), Global Namespace Wan Integration (AFM), Encryption, etc just to name a few. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Vic Cornell To: gpfsug main discussion list Date: 08/08/2014 10:16 AM Subject: Re: [gpfsug-discuss] GPFS and Lustre on same node Sent by: gpfsug-discuss-bounces at gpfsug.org Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kraemerf at de.ibm.com Sat Aug 9 15:03:02 2014 From: kraemerf at de.ibm.com (Frank Kraemer) Date: Sat, 9 Aug 2014 16:03:02 +0200 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: Message-ID: Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany From ewahl at osc.edu Mon Aug 11 14:55:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 11 Aug 2014 13:55:48 +0000 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: , Message-ID: In a similar vein, IBM has an application transparent "File Cache Library" as well. I believe it IS licensed and the only requirement is that it is for use on IBM hardware only. Saw some presentations that mention it in some BioSci talks @SC13 and the numbers for a couple of selected small read applications were awesome. I probably have the contact info for it around here somewhere. In addition to the pdf/user manual. Ed Wahl Ohio Supercomputer Center ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Frank Kraemer [kraemerf at de.ibm.com] Sent: Saturday, August 09, 2014 10:03 AM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] GPFS and Lustre Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From sabujp at gmail.com Tue Aug 12 23:16:22 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 12 Aug 2014 17:16:22 -0500 Subject: [gpfsug-discuss] reduce cnfs failover time to a few seconds Message-ID: Hi all, Is there anyway to reduce CNFS failover time to just a few seconds? Currently it seems like it's taking 5 - 10 minutes. We're using virtual ip's, i.e. interface bond1.1550:0 has one of the cnfs vips, so it should be fast, but it takes a long time and sometimes causes processes to crash due to NFS timeouts (some have 600 second soft mount timeouts). We've also noticed that it sometimes takes even longer unless the cnfs system on which we're calling mmshutdown is completely shutdown and isn't returning pings. Even 1 min seems too long. For comparison, I'm running ctdb + samba on the other NSDs and it's able to failover in a few seconds after mmshutdown completes. Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 15 14:31:29 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 15 Aug 2014 14:31:29 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters Message-ID: <53EE0BB1.8000005@ebi.ac.uk> Hello people, Its quite a bit of time that i'm triing to solve a problem to our GPFS system, without much luck so i think its time to ask some help. *First of a bit of introduction:** * Our GPFS system is made by 3xgss-26, In other words its made with 6x servers ( 4x10g links each) and several disk enclosures SAS attacked. The todal amount of spare its roughly 2PB, and the disks are SATA ( except few SSD dedicated to logtip ). My metadata and on dedicated vdisks, but both data and metadata vdiosks are in the same declustered arrays and recovery groups, so in the end they share the same spindles. The clients its a LSF farm configured as another cluster ( standard multiclustering configuration) of roughly 600 nodes . *The issue:** * Recently we became aware that when some massive io request has been done we experience a lot of client expells. Heres an example of our logs: Fri Aug 15 12:40:24.680 2014: Expel 10.7.28.34 (gss03a) request from 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:41.652 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:45.754 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:52.305 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:17.069 2014: Expel 10.7.28.35 (gss03b) request from 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:23.555 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.258 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.540 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:42:57.288 2014: Expel 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) request from 10.7.28.34 (gss03a). Expelling: 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:43:24.327 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:44:54.202 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:15:54.450 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:20:16.524 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:26:54.177 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:27:53.900 2014: Expel 10.7.28.3 (gss01b) request from 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:28:24.297 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:29:23.913 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk) at the same time we experience also long waiters queue (1000+ lines). An example in case of massive writes ( dd ) : 0x7F522E1EEF90 waiting 1.861233182 seconds, NSDThread: on ThCond 0x7F5158019B08 (0x7F5158019B08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.101 0x7F522E1EC9B0 waiting 1.490567470 seconds, NSDThread: on ThCond 0x7F50F4038BA8 (0x7F50F4038BA8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1EB6C0 waiting 1.077098046 seconds, NSDThread: on ThCond 0x7F50B40011F8 (0x7F50B40011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.156 0x7F522E1EA3D0 waiting 7.714968554 seconds, NSDThread: on ThCond 0x7F50BC0078B8 (0x7F50BC0078B8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.107 0x7F522E1E90E0 waiting 4.774379417 seconds, NSDThread: on ThCond 0x7F506801B1F8 (0x7F506801B1F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.23 0x7F522E1E7DF0 waiting 0.746172444 seconds, NSDThread: on ThCond 0x7F5094007D78 (0x7F5094007D78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.84 0x7F522E1E6B00 waiting 1.553030487 seconds, NSDThread: on ThCond 0x7F51C0004C78 (0x7F51C0004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.63 0x7F522E1E5810 waiting 2.165307633 seconds, NSDThread: on ThCond 0x7F5178016A08 (0x7F5178016A08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.29 0x7F522E1E4520 waiting 1.128089273 seconds, NSDThread: on ThCond 0x7F5074004D98 (0x7F5074004D98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1E3230 waiting 2.515214328 seconds, NSDThread: on ThCond 0x7F51F400EF08 (0x7F51F400EF08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.90 0x7F522E1E1F40 waiting*162.966840834* seconds, NSDThread: on ThCond 0x7F51840207A8 (0x7F51840207A8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.97 0x7F522E1E0C50 waiting 1.140787288 seconds, NSDThread: on ThCond 0x7F51AC005C08 (0x7F51AC005C08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.94 0x7F522E1DF960 waiting 41.907415248 seconds, NSDThread: on ThCond 0x7F5160019038 (0x7F5160019038) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1DE670 waiting 0.466560418 seconds, NSDThread: on ThCond 0x7F513802B258 (0x7F513802B258) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.168 0x7F522E1DD380 waiting 3.102803621 seconds, NSDThread: on ThCond 0x7F516C0106C8 (0x7F516C0106C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.91 0x7F522E1DC090 waiting 2.751614295 seconds, NSDThread: on ThCond 0x7F504C0011F8 (0x7F504C0011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.25 0x7F522E1DADA0 waiting 5.083691891 seconds, NSDThread: on ThCond 0x7F507401BE88 (0x7F507401BE88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1D9AB0 waiting 2.263374184 seconds, NSDThread: on ThCond 0x7F5080003B98 (0x7F5080003B98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.36 0x7F522E1D87C0 waiting 0.206989639 seconds, NSDThread: on ThCond 0x7F505801F0D8 (0x7F505801F0D8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.55 0x7F522E1D74D0 waiting *41.841279897* seconds, NSDThread: on ThCond 0x7F5194008B88 (0x7F5194008B88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1D61E0 waiting 5.618652361 seconds, NSDThread: on ThCond 0x1BAB868 (0x1BAB868) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.59 0x7F522E1D4EF0 waiting 6.185658427 seconds, NSDThread: on ThCond 0x7F513802AAE8 (0x7F513802AAE8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.6 0x7F522E1D3C00 waiting 2.652370892 seconds, NSDThread: on ThCond 0x7F5130004C78 (0x7F5130004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1D2910 waiting 11.396142225 seconds, NSDThread: on ThCond 0x7F51A401C0C8 (0x7F51A401C0C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.169 0x7F522E1D1620 waiting 63.710723043 seconds, NSDThread: on ThCond 0x7F5038004D08 (0x7F5038004D08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.120 or for massive reads: 0x7FBCE69A8C20 waiting 29.262629530 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE699CEC0 waiting 29.260869141 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE698C5A0 waiting 29.124824888 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6984110 waiting 22.729479654 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69512C0 waiting 29.272805926 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69409A0 waiting 28.833650198 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6924320 waiting 29.237067128 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6921D40 waiting 29.237953228 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6915FE0 waiting 29.046721161 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6913A00 waiting 29.264534710 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6900B00 waiting 29.267691105 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68F7380 waiting 29.266402464 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68D2870 waiting 29.276298231 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68BADB0 waiting 28.665700576 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68B61F0 waiting 29.236878611 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6885980 waiting *144*.530487248 seconds, NSDThread: on ThMutex 0x1803396A670 (0xFFFFC9003396A670) (DiskSchedulingMutex) 0x7FBCE68833A0 waiting 29.231066610 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68820B0 waiting 29.269954514 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE686A5F0 waiting *140*.662994256 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE6864740 waiting 29.254180742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE683FC30 waiting 29.271840565 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE682E020 waiting 29.200969209 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6825B90 waiting 19.136732919 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6805C40 waiting 29.236055550 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FEAA0 waiting 29.283264161 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FC4C0 waiting 29.268992663 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67DFE40 waiting 29.150900786 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D2DF0 waiting 29.199058463 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D1B00 waiting 29.203199738 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67768D0 waiting 29.208231742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6768590 waiting 5.228192589 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67672A0 waiting 29.252839376 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6757C70 waiting 28.869359044 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6748640 waiting 29.289284179 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6734450 waiting 29.253591817 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6730B80 waiting 29.289987273 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6720260 waiting 26.597589551 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66F32C0 waiting 29.177692849 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66E3C90 waiting 29.160268518 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66CC1D0 waiting 5.334330188 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66B3420 waiting 34.274433161 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE668E910 waiting 27.699999488 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6689D50 waiting 34.279090465 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE66805D0 waiting 24.688626241 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6675B60 waiting 35.367745840 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE665E0A0 waiting 29.235994598 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE663CE60 waiting 29.162911979 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' Another example with mmfsadm in case of massive reads: [root at gss02b ~]# mmfsadm dump waiters 0x7F519000AEA0 waiting 28.915010347 seconds, replyCleanupThread: on ThCond 0x7F51101B27B8 (0x7F51101B27B8) (MsgRecordCondvar), reason 'RPC wait' 0x7F511C012A10 waiting 279.522206863 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F52000095F8 (0x7F52000095F8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5120000B80 waiting 279.524782437 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5214000EE8 (0x7F5214000EE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5154006310 waiting 138.164386224 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5174003F08 (0x7F5174003F08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1EB6C0 waiting 23.060703000 seconds, NSDThread: for poll on sock 85 0x7F522E1E6B00 waiting 0.068456104 seconds, NSDThread: on ThCond 0x7F50CC00E478 (0x7F50CC00E478) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1D0330 waiting 17.207907857 seconds, NSDThread: on ThCond 0x7F5078001688 (0x7F5078001688) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1BFA10 waiting 0.181011711 seconds, NSDThread: on ThCond 0x7F504000E558 (0x7F504000E558) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B4FA0 waiting 0.021780338 seconds, NSDThread: on ThCond 0x7F522000E488 (0x7F522000E488) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B3CB0 waiting 0.794718000 seconds, NSDThread: for poll on sock 799 0x7F522E186D10 waiting 0.191606803 seconds, NSDThread: on ThCond 0x7F5184015D58 (0x7F5184015D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E184730 waiting 0.025562000 seconds, NSDThread: for poll on sock 867 0x7F522E12CDD0 waiting 0.008921000 seconds, NSDThread: for poll on sock 543 0x7F522E126F20 waiting 1.459531000 seconds, NSDThread: for poll on sock 983 0x7F522E10F460 waiting 17.177936972 seconds, NSDThread: on ThCond 0x7F51EC002CE8 (0x7F51EC002CE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E101120 waiting 17.232580316 seconds, NSDThread: on ThCond 0x7F51BC005BB8 (0x7F51BC005BB8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0F1AF0 waiting 438.556030000 seconds, NSDThread: for poll on sock 496 0x7F522E0E7080 waiting 393.702839774 seconds, NSDThread: on ThCond 0x7F5164013668 (0x7F5164013668) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E09DA60 waiting 52.746984660 seconds, NSDThread: on ThCond 0x7F506C008858 (0x7F506C008858) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E084CB0 waiting 23.096688206 seconds, NSDThread: on ThCond 0x7F521C008E18 (0x7F521C008E18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0839C0 waiting 0.093456000 seconds, NSDThread: for poll on sock 962 0x7F522E076970 waiting 2.236659731 seconds, NSDThread: on ThCond 0x7F51E0027538 (0x7F51E0027538) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E044E10 waiting 52.752497765 seconds, NSDThread: on ThCond 0x7F513802BDD8 (0x7F513802BDD8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E033200 waiting 16.157355796 seconds, NSDThread: on ThCond 0x7F5104240D58 (0x7F5104240D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E02AD70 waiting 436.025203220 seconds, NSDThread: on ThCond 0x7F50E0016C28 (0x7F50E0016C28) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E01A450 waiting 393.673252777 seconds, NSDThread: on ThCond 0x7F50A8009C18 (0x7F50A8009C18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DFE0460 waiting 1.781358358 seconds, NSDThread: on ThCond 0x7F51E0027638 (0x7F51E0027638) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF99420 waiting 0.038405427 seconds, NSDThread: on ThCond 0x7F50F0172B18 (0x7F50F0172B18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF7CDA0 waiting 438.204625355 seconds, NSDThread: on ThCond 0x7F50900023D8 (0x7F50900023D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF76EF0 waiting 435.903645734 seconds, NSDThread: on ThCond 0x7F5084004BC8 (0x7F5084004BC8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF74910 waiting 21.749325022 seconds, NSDThread: on ThCond 0x7F507C011F48 (0x7F507C011F48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF71040 waiting 1.027274000 seconds, NSDThread: for poll on sock 866 0x7F522DF536D0 waiting 52.953847324 seconds, NSDThread: on ThCond 0x7F5200006FF8 (0x7F5200006FF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF510F0 waiting 0.039278000 seconds, NSDThread: for poll on sock 837 0x7F522DF4EB10 waiting 0.085745937 seconds, NSDThread: on ThCond 0x7F51F0006828 (0x7F51F0006828) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF4C530 waiting 21.850733000 seconds, NSDThread: for poll on sock 986 0x7F522DF4B240 waiting 0.054739884 seconds, NSDThread: on ThCond 0x7F51EC0168D8 (0x7F51EC0168D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF48C60 waiting 0.186409714 seconds, NSDThread: on ThCond 0x7F51E4000908 (0x7F51E4000908) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF41AC0 waiting 438.942861290 seconds, NSDThread: on ThCond 0x7F51CC010168 (0x7F51CC010168) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF3F4E0 waiting 0.060235106 seconds, NSDThread: on ThCond 0x7F51C400A438 (0x7F51C400A438) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF22E60 waiting 0.361288000 seconds, NSDThread: for poll on sock 518 0x7F522DF21B70 waiting 0.060722464 seconds, NSDThread: on ThCond 0x7F51580162D8 (0x7F51580162D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF12540 waiting 23.077564448 seconds, NSDThread: on ThCond 0x7F512C13E1E8 (0x7F512C13E1E8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEFD060 waiting 0.723370000 seconds, NSDThread: for poll on sock 503 0x7F522DEE09E0 waiting 1.565799175 seconds, NSDThread: on ThCond 0x7F5084004D58 (0x7F5084004D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDF6F0 waiting 22.063017342 seconds, NSDThread: on ThCond 0x7F5078003E08 (0x7F5078003E08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDD110 waiting 0.049108780 seconds, NSDThread: on ThCond 0x7F5070001D78 (0x7F5070001D78) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDAB30 waiting 229.603224376 seconds, NSDThread: on ThCond 0x7F50680221B8 (0x7F50680221B8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED7260 waiting 0.071855457 seconds, NSDThread: on ThCond 0x7F506400A5A8 (0x7F506400A5A8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED5F70 waiting 0.648324000 seconds, NSDThread: for poll on sock 766 0x7F522DEC3070 waiting 1.809205756 seconds, NSDThread: on ThCond 0x7F522000E518 (0x7F522000E518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEB1460 waiting 436.017396645 seconds, NSDThread: on ThCond 0x7F51E4000978 (0x7F51E4000978) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEAC8A0 waiting 393.734102000 seconds, NSDThread: for poll on sock 609 0x7F522DEA3120 waiting 17.960778837 seconds, NSDThread: on ThCond 0x7F51B4001708 (0x7F51B4001708) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE86AA0 waiting 23.112060045 seconds, NSDThread: on ThCond 0x7F5154096118 (0x7F5154096118) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE64570 waiting 0.076167410 seconds, NSDThread: on ThCond 0x7F50D8005EF8 (0x7F50D8005EF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE1AF50 waiting 17.460836000 seconds, NSDThread: for poll on sock 737 0x7F522DE104E0 waiting 0.205037000 seconds, NSDThread: for poll on sock 865 0x7F522DDB8B80 waiting 0.106192000 seconds, NSDThread: for poll on sock 78 0x7F522DDA36A0 waiting 0.738921180 seconds, NSDThread: on ThCond 0x7F505400E048 (0x7F505400E048) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD9C500 waiting 0.731118367 seconds, NSDThread: on ThCond 0x7F503C00B518 (0x7F503C00B518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD89600 waiting 229.609363000 seconds, NSDThread: for poll on sock 515 0x7F522DD567B0 waiting 1.508489195 seconds, NSDThread: on ThCond 0x7F514C021F88 (0x7F514C021F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' Another thing worth to mention is that the filesystem its totaly unresponsive. Even a simple "cd" to a directory or an ls to a directory just hangs for several minutes ( litterally). This happens also if i try from the NSD servers. *Few things i have looked into:* * Our network seems fine, there might be some bottleneck on part of them, and this could explain the waiters, but doesnt explain why ad some poit those client ask to expel the NSD servers. THis also doesn't justify why the FS is slow even on NSD itself. * Disk bottleneck? i dont think so. NSD servers have cpu usage (and io wait ) very low. Also mmdiag --iohist seems condirming that the operation on the disks are reasonable fast: === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- ---- ------------------ --------------- 13:54:29.209276 W data 34:5066338808 2056 88.307 lcl sdtu 13:54:29.209277 W data 55:5095698936 2056 27.592 lcl sdaab 13:54:29.209278 W data 171:5104087544 2056 22.801 lcl sdtg 13:54:29.209279 W data 116:5011812856 2056 65.983 lcl sdqr 13:54:29.209280 W data 98:4860817912 2056 17.892 lcl sddl 13:54:29.209281 W data 159:4999229944 2056 21.324 lcl sdjg 13:54:29.209282 W data 84:5049561592 2056 31.932 lcl sdqz 13:54:29.209283 W data 8:5003424248 2056 30.912 lcl sdcw 13:54:29.209284 W data 23:4965675512 2056 27.366 lcl sdpt 13:54:29.297715 W vdiskMDLog 2:144008496 1 0.236 lcl sdkr 13:54:29.297717 W vdiskMDLog 0:331703600 1 0.230 lcl sdcm 13:54:29.297718 W vdiskMDLog 1:273769776 1 0.241 lcl sdbp 13:54:29.244902 W data 51:3857589752 2056 35.566 lcl sdyi 13:54:29.244904 W data 10:3773703672 2056 28.512 lcl sdma 13:54:29.244905 W data 48:3639485944 2056 24.124 lcl sdel 13:54:29.244906 W data 25:3777897976 2056 18.691 lcl sdgt 13:54:29.244908 W data 91:3832423928 2056 20.699 lcl sdlc 13:54:29.244909 W data 115:3723372024 2056 30.783 lcl sdho 13:54:29.244910 W data 173:3882755576 2056 53.241 lcl sdti 13:54:29.244911 W data 42:3782092280 2056 22.785 lcl sddz 13:54:29.244912 W data 45:3647874552 2056 24.289 lcl sdei 13:54:29.244913 W data 32:3652068856 2056 17.220 lcl sdbn 13:54:29.244914 W data 39:3677234680 2056 26.017 lcl sddw 13:54:29.298273 W vdiskMDLog 2:144008497 1 2.522 lcl sduf 13:54:29.298274 W vdiskMDLog 0:331703601 1 1.025 lcl sdlo 13:54:29.298275 W vdiskMDLog 1:273769777 1 2.586 lcl sdtt 13:54:29.288275 W data 27:2249588200 2056 20.071 lcl sdhb 13:54:29.288279 W data 33:2224422376 2056 19.682 lcl sdts 13:54:29.288281 W data 47:2115370472 2056 21.667 lcl sdwo 13:54:29.288282 W data 82:2316697064 2056 21.524 lcl sdxy 13:54:29.288283 W data 85:2232810984 2056 17.467 lcl sdra 13:54:29.288285 W data 30:2127953384 2056 18.475 lcl sdqg 13:54:29.288286 W data 67:1876295144 2056 16.383 lcl sdmx 13:54:29.288287 W data 64:2127953384 2056 21.908 lcl sduh 13:54:29.288288 W data 38:2253782504 2056 19.775 lcl sddv 13:54:29.288290 W data 15:2207645160 2056 20.599 lcl sdet 13:54:29.288291 W data 157:2283142632 2056 21.198 lcl sdiy Bonding problem on the interfaces? Mellanox ( interface card prodicer) drivers and firmware updated, and we even tested the system with a single link ( without bonding). Could someone help me with this? in particular: * What exactly are client are looking to decide that another node is unresponsive? Ping? i dont think so because both NSD servers and clients can be pinged, so what they look? if comeone can also specify what port are they using i can try to tcpdump what exactly is cauding this expell. * How can i monitor metadata operations to understand where EXACTLY is the bottleneck that causes this: [sdinardo at ebi5-001 ~]$ time ls /gpfs/nobackup/sdinardo 1 ebi3-054.ebi.ac.uk ebi3-154 ebi5-019.ebi.ac.uk ebi5-052 ebi5-101 ebi5-156 ebi5-197 ebi5-228 ebi5-262.ebi.ac.uk 10 ebi3-055 ebi3-155 ebi5-021.ebi.ac.uk ebi5-053 ebi5-104.ebi.ac.uk ebi5-160.ebi.ac.uk ebi5-198 ebi5-229 ebi5-263 2 ebi3-056.ebi.ac.uk ebi3-156 ebi5-022 ebi5-054.ebi.ac.uk ebi5-106 ebi5-161 ebi5-200 ebi5-230.ebi.ac.uk ebi5-264 3 ebi3-057 ebi3-157 ebi5-023 ebi5-056 ebi5-109 ebi5-162.ebi.ac.uk ebi5-201 ebi5-231.ebi.ac.uk ebi5-265 4 ebi3-058 ebi3-158.ebi.ac.uk ebi5-024.ebi.ac.uk ebi5-057 ebi5-110.ebi.ac.uk ebi5-163.ebi.ac.uk ebi5-202.ebi.ac.uk ebi5-232 ebi5-266.ebi.ac.uk 5 ebi3-059.ebi.ac.uk ebi3-160 ebi5-025 ebi5-060 ebi5-111.ebi.ac.uk ebi5-164 ebi5-204 ebi5-233 ebi5-267 6 ebi3-132 ebi3-161.ebi.ac.uk ebi5-026 ebi5-061.ebi.ac.uk ebi5-112.ebi.ac.uk ebi5-165 ebi5-205 ebi5-234 ebi5-269.ebi.ac.uk 7 ebi3-133 ebi3-163.ebi.ac.uk ebi5-028 ebi5-062.ebi.ac.uk ebi5-129.ebi.ac.uk ebi5-166 ebi5-206.ebi.ac.uk ebi5-236 ebi5-270 8 ebi3-134 ebi3-165 ebi5-030 ebi5-064 ebi5-131.ebi.ac.uk ebi5-169.ebi.ac.uk ebi5-207 ebi5-237 ebi5-271 9 ebi3-135 ebi3-166.ebi.ac.uk ebi5-031 ebi5-065 ebi5-132 ebi5-170.ebi.ac.uk ebi5-209 ebi5-239.ebi.ac.uk launcher.sh _*real 21m14.948s*_( WTH ?!?!?!) user 0m0.004s sys 0m0.014s I know that the question are not easy to answer, and i need to dig more, but could be very helpful if someone give me some hints about where to look at. My gpfs skills are limited since this is our first system and is in production for just few months, and the things stated to worsen just recenlty. In past we could get over 200Gb/s ( both read and write) without any issue. Now some clients get expelled even when data thoughuput is ad 4-5Gb/s. Thanks in advance for any help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 11:18:10 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 11:18:10 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance Message-ID: Hi all, You may be aware that the website has been down for about a week now. This is due to the amount of traffic to the website and the amount of people on the mailing list, we had seen a few issues on the system. In order to counter the issues, we are moving to a new system to counter any future issues, and ease of management. We are hoping to do this tonight ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then please let me know. I will, as part of the move over, will be sending a few test mails to make sure that mailing list is working correctly. Thanks for your patience -- Arif Ali gpfsug Admin IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Tue Aug 19 12:11:00 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Tue, 19 Aug 2014 12:11:00 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F330C4.808@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 20:59:47 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 20:59:47 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: This is a test mail to the mailing list please do not reply -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 11:18, Arif Ali wrote: > Hi all, > > You may be aware that the website has been down for about a week now. This > is due to the amount of traffic to the website and the amount of people on > the mailing list, we had seen a few issues on the system. > > In order to counter the issues, we are moving to a new system to counter > any future issues, and ease of management. We are hoping to do this tonight > ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then > please let me know. > > I will, as part of the move over, will be sending a few test mails to make > sure that mailing list is working correctly. > > Thanks for your patience > > -- > Arif Ali > gpfsug Admin > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 23:41:48 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 23:41:48 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: Thanks for all your patience, The service should all be back up again -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 20:59, Arif Ali wrote: > This is a test mail to the mailing list > > please do not reply > > -- > Arif Ali > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > > > On 19 August 2014 11:18, Arif Ali wrote: > >> Hi all, >> >> You may be aware that the website has been down for about a week now. >> This is due to the amount of traffic to the website and the amount of >> people on the mailing list, we had seen a few issues on the system. >> >> In order to counter the issues, we are moving to a new system to counter >> any future issues, and ease of management. We are hoping to do this tonight >> ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then >> please let me know. >> >> I will, as part of the move over, will be sending a few test mails to >> make sure that mailing list is working correctly. >> >> Thanks for your patience >> >> -- >> Arif Ali >> gpfsug Admin >> >> IRC: arif-ali at freenode >> LinkedIn: http://uk.linkedin.com/in/arifali >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 08:57:23 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 08:57:23 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F454E3.40803@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 09:03:03 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 09:03:03 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F45637.8080000@ebi.ac.uk> Another interesting case about a specific waiter: was looking the waiters on GSS until i found those( i got those info collecting from all the servers with a script i did, so i was able to trace hanging connection while they was happening): gss03b.ebi.ac.uk:*235.373993397*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*235.152271998*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss02a.ebi.ac.uk:*214.079093620 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109 gss02a.ebi.ac.uk:*213.580199240 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109 gss03b.ebi.ac.uk:*132.375138082*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*132.374973884 *(MsgRecordCondvar), reason 'RPC wait' for commMsgCheckMessages on node 10.7.37.109 the bolted number are seconds. looking at this page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information The web page claim that's, probably a network congestion, but i managed to login quick enough to the client and there the waiters was: [root at ebi5-236 ~]# mmdiag --waiters === mmdiag: waiters === 0x7F6690073460 waiting 147.973009173 seconds, RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0 (0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for LX lock' 0x7F65100036D0 waiting 140.458589856 seconds, WritebehindWorkerThread: on ThCond 0x7F6500000F98 (0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63A0001080 waiting 245.153055801 seconds, WritebehindWorkerThread: on ThCond 0x7F65D8002CF8 (0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C03D3D0 waiting 245.750977203 seconds, CleanBufferThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674802E360 waiting 244.159861966 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0002358 (0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C038810 waiting 251.086748430 seconds, SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398 (0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait' for I/O completion on node 10.7.28.35 0x7F674C036230 waiting 139.556735095 seconds, CleanBufferThread: on ThCond 0x7F65CC004C78 (0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C031670 waiting 144.327593052 seconds, WritebehindWorkerThread: on ThCond 0x7F672402D1A8 (0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C02A4D0 waiting 145.202712821 seconds, WritebehindWorkerThread: on ThCond 0x7F65440018F8 (0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C0291E0 waiting 247.131569232 seconds, PrefetchWorkerThread: on ThCond 0x7F65740016C8 (0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748025BD0 waiting 11.631381523 seconds, replyCleanupThread: on ThCond 0x7F65E000A1F8 (0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait' 0x7F6748022300 waiting 245.616267612 seconds, WritebehindWorkerThread: on ThCond 0x7F6470001468 (0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748021010 waiting 230.769670930 seconds, InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674801B160 waiting 245.830554594 seconds, UnusedInodePrefetchThread: on ThCond 0x7F65B8004438 (0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674800A820 waiting 252.332932000 seconds, Msg handler getData: for poll on sock 109 0x7F63F4023090 waiting 253.073535042 seconds, WritebehindWorkerThread: on ThCond 0x7F65C4000CC8 (0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64A4000CE0 waiting 145.049659249 seconds, WritebehindWorkerThread: on ThCond 0x7F6560000A98 (0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6778006D00 waiting 142.124664264 seconds, WritebehindWorkerThread: on ThCond 0x7F63DC000C08 (0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780046D0 waiting 251.751439453 seconds, WritebehindWorkerThread: on ThCond 0x7F6454000A98 (0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E4B70 waiting 142.431051232 seconds, WritebehindWorkerThread: on ThCond 0x7F63C80010D8 (0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E50D0 waiting 244.339624817 seconds, WritebehindWorkerThread: on ThCond 0x7F65BC001B98 (0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6434000B40 waiting 145.343700410 seconds, WritebehindWorkerThread: on ThCond 0x7F63B00036E8 (0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F670C0187A0 waiting 244.903963969 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0000FB8 (0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04E2F0 waiting 245.837137631 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4000A98 (0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04AA20 waiting 139.713993908 seconds, WritebehindWorkerThread: on ThCond 0x7F6454002478 (0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C049730 waiting 252.434187472 seconds, WritebehindWorkerThread: on ThCond 0x7F65F4003708 (0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C044B70 waiting 131.515829048 seconds, Msg handler ccMsgPing: on ThCond 0x7F64DC1D4888 (0x7F64DC1D4888) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6758008DE0 waiting 149.548547226 seconds, Msg handler getData: on ThCond 0x7F645C002458 (0x7F645C002458) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F67580071D0 waiting 149.548543118 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F6450001C48 (0x7F6450001C48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6448001620 waiting 139.844870446 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0003098 (0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63F4000F80 waiting 245.044791905 seconds, WritebehindWorkerThread: on ThCond 0x7F6450001188 (0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F659C0033A0 waiting 243.464399305 seconds, PrefetchWorkerThread: on ThCond 0x7F6554002598 (0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6514001690 waiting 245.826160463 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4004558 (0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64800012B0 waiting 253.174835511 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0000FB8 (0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6510000EE0 waiting 140.746696039 seconds, WritebehindWorkerThread: on ThCond 0x7F647C000CC8 (0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6754001BB0 waiting 246.336055629 seconds, PrefetchWorkerThread: on ThCond 0x7F6594002498 (0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6420000930 waiting 140.606777450 seconds, WritebehindWorkerThread: on ThCond 0x7F6578002498 (0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744009110 waiting 137.466372831 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65F4007158 (0x7F65F4007158) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67280119F0 waiting 144.173427360 seconds, WritebehindWorkerThread: on ThCond 0x7F6504000AE8 (0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F672800BB40 waiting 145.804301887 seconds, WritebehindWorkerThread: on ThCond 0x7F6550001038 (0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6728000910 waiting 252.601993452 seconds, WritebehindWorkerThread: on ThCond 0x7F6450000A98 (0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744007E20 waiting 251.603329204 seconds, WritebehindWorkerThread: on ThCond 0x7F6570004C18 (0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64AC002EF0 waiting 139.205774422 seconds, FileBlockWriteFetchHandlerThread: on ThCond 0x18020AF0260 (0xFFFFC90020AF0260) (FetchFlowControlCondvar), reason 'wait for buffer for fetch' 0x7F6724013050 waiting 71.501580932 seconds, Msg handler ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F661C000DA0 waiting 245.654985276 seconds, PrefetchWorkerThread: on ThCond 0x7F6570005288 (0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00F440 waiting 251.096002003 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65BC002878 (0x7F65BC002878) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00E150 waiting 144.034006970 seconds, WritebehindWorkerThread: on ThCond 0x7F6528001548 (0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FCD20 waiting 142.324070945 seconds, WritebehindWorkerThread: on ThCond 0x7F6580002A98 (0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FA330 waiting 200.670114385 seconds, EEWatchDogThread: on ThCond 0x7F65B0000A98 (0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait' 0x7F67A02BF050 waiting 252.276161189 seconds, WritebehindWorkerThread: on ThCond 0x7F6584003998 (0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A0004160 waiting 251.173651822 seconds, SyncHandlerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write' So from the client side its the client that's waiting the server. I managed also to ping, ssh, and tcpdump each other before the node got expelled and discovered that ping works fine, ssh work fine , beside my tests there are 0 packet passing between them, LITERALLY. So there is no congestion, no network issues, but the server waits for the client and the client waits the server. This happens until we reach 350 secs ( 10 times the lease time) , then client get expelled. There are no local io waiters that indicates that gss is struggling, there is plenty of bandwith and CPU resources and no network congestion. Seems some sort of deadlock to me, but how can this be explained and hopefully fixed? Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at gpfsug.org Thu Aug 21 09:20:39 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Thu, 21 Aug 2014 09:20:39 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F5ABD7.80107@gpfsug.org> Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: > Still problems. Here some more detailed examples: > > *EXAMPLE 1:* > > *EBI5-220**( CLIENT)** > *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a > reply from node gss02b* > Tue Aug 19 11:03:04.981 2014: Request sent to > (gss02a in GSS.ebi.ac.uk) to expel (gss02b in > GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:04.982 2014: This node will be expelled > from cluster GSS.ebi.ac.uk due to expel msg from IP> (ebi5-220) > Tue Aug 19 11:03:09.319 2014: Cluster Manager connection > broke. Probing cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:03:10.322 2014: Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems. > Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount > invoked. File system: gpfs1 Reason: SGPanic > Tue Aug 19 11:03:12.066 2014: Connecting to > gss02a > Tue Aug 19 11:03:12.070 2014: Connected to > gss02a > Tue Aug 19 11:03:17.071 2014: Connecting to > gss02b > Tue Aug 19 11:03:17.072 2014: Connecting to > gss03b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss03a > Tue Aug 19 11:03:17.080 2014: Connecting to > gss01b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss01a > Tue Aug 19 11:04:23.105 2014: Connected to > gss02b > Tue Aug 19 11:04:23.107 2014: Connected to > gss03b > Tue Aug 19 11:04:23.112 2014: Connected to > gss03a > Tue Aug 19 11:04:23.115 2014: Connected to > gss01b > Tue Aug 19 11:04:23.121 2014: Connected to > gss01a > Tue Aug 19 11:12:28.992 2014: Node (gss02a in > GSS.ebi.ac.uk) is now the Group Leader. > > *GSS02B ( NSD SERVER)* > ... > Tue Aug 19 11:03:17.070 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:25.016 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:28.080 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:36.019 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:39.083 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:47.023 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:50.088 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:52.218 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:58.030 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:01.092 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:03.220 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:09.034 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:12.096 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:14.224 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:20.037 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:23.103 2014: Accepted and connected to > ** ebi5-220 > ... > > *GSS02a ( NSD SERVER)* > Tue Aug 19 11:03:04.980 2014: Expel (gss02b) > request from (ebi5-220 in > ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 > in ebi-cluster.ebi.ac.uk) > Tue Aug 19 11:03:12.069 2014: Accepted and connected to > ebi5-220 > > > =============================================== > *EXAMPLE 2*: > > *EBI5-038* > Tue Aug 19 11:32:34.227 2014: *Disk lease period expired > in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* > Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing > cluster GSS.ebi.ac.uk* > Tue Aug 19 11:35:24.265 2014: Close connection to IP> gss02a (Connection reset by peer). Attempting > reconnect. > Tue Aug 19 11:35:24.865 2014: Close connection to > ebi5-014 (Connection reset by > peer). Attempting reconnect. > ... > LOT MORE RESETS BY PEER > ... > Tue Aug 19 11:35:25.096 2014: Close connection to > ebi5-167 (Connection reset by > peer). Attempting reconnect. > Tue Aug 19 11:35:25.267 2014: Connecting to > gss02a > Tue Aug 19 11:35:25.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:35:26.267 2014: Retry connection to IP> gss02a > Tue Aug 19 11:35:26.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems.* > > *GSS02a* > Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 > in ebi-cluster.ebi.ac.uk) *is being expelled because of an > expired lease.* Pings sent: 60. Replies received: 60. > > > > > In example 1 seems that an NSD was not repliyng to the client, but the > servers seems working fine.. how can i trace better ( to solve) the > problem? > > In example 2 it seems to me that for some reason the manager are not > renewing the lease in time. when this happens , its not a single client. > Loads of them fail to get the lease renewed. Why this is happening? > how can i trace to the source of the problem? > > > > Thanks in advance for any tips. > > Regards, > Salvatore > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 10:04:47 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 10:04:47 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5ABD7.80107@gpfsug.org> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> Message-ID: <53F5B62F.1060305@ebi.ac.uk> Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called */input_file/* of nearly 100GB: if from *client A* we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) *client B* : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: > Hi there, > > I've seen the on several 'stock'? 'core'? GPFS system (we need a > better term now GSS is out) and seen ping 'working', but alongside > ejections from the cluster. > The GPFS internode 'ping' is somewhat more circumspect than unix ping > - and rightly so. > > In my experience this has _always_ been a network issue of one sort of > another. If the network is experiencing issues, nodes will be ejected. > Of course it could be unresponsive mmfsd or high loadavg, but I've > seen that only twice in 10 years over many versions of GPFS. > > You need to follow the logs through from each machine in time order to > determine who could not see who and in what order. > Your best way forward is to log a SEV2 case with IBM support, directly > or via your OEM and collect and supply a snap and traces as required > by support. > > Without knowing your full setup, it's hard to help further. > > Jez > > On 20/08/14 08:57, Salvatore Di Nardo wrote: >> Still problems. Here some more detailed examples: >> >> *EXAMPLE 1:* >> >> *EBI5-220**( CLIENT)** >> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >> reply from node gss02b* >> Tue Aug 19 11:03:04.981 2014: Request sent to >> (gss02a in GSS.ebi.ac.uk) to expel (gss02b in >> GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:04.982 2014: This node will be expelled >> from cluster GSS.ebi.ac.uk due to expel msg from >> (ebi5-220) >> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >> broke. Probing cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:10.321 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems. >> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >> invoked. File system: gpfs1 Reason: SGPanic >> Tue Aug 19 11:03:12.066 2014: Connecting to >> gss02a >> Tue Aug 19 11:03:12.070 2014: Connected to >> gss02a >> Tue Aug 19 11:03:17.071 2014: Connecting to >> gss02b >> Tue Aug 19 11:03:17.072 2014: Connecting to >> gss03b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss03a >> Tue Aug 19 11:03:17.080 2014: Connecting to >> gss01b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss01a >> Tue Aug 19 11:04:23.105 2014: Connected to >> gss02b >> Tue Aug 19 11:04:23.107 2014: Connected to >> gss03b >> Tue Aug 19 11:04:23.112 2014: Connected to >> gss03a >> Tue Aug 19 11:04:23.115 2014: Connected to >> gss01b >> Tue Aug 19 11:04:23.121 2014: Connected to >> gss01a >> Tue Aug 19 11:12:28.992 2014: Node (gss02a in >> GSS.ebi.ac.uk) is now the Group Leader. >> >> *GSS02B ( NSD SERVER)* >> ... >> Tue Aug 19 11:03:17.070 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:25.016 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:28.080 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:36.019 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:39.083 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:47.023 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:50.088 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:52.218 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:58.030 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:01.092 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:03.220 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:09.034 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:12.096 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:14.224 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:20.037 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >> ** ebi5-220 >> ... >> >> *GSS02a ( NSD SERVER)* >> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >> request from (ebi5-220 in >> ebi-cluster.ebi.ac.uk). Expelling: >> (ebi5-220 in ebi-cluster.ebi.ac.uk) >> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >> ebi5-220 >> >> >> =============================================== >> *EXAMPLE 2*: >> >> *EBI5-038* >> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >> cluster GSS.ebi.ac.uk* >> Tue Aug 19 11:35:24.265 2014: Close connection to > IP> gss02a (Connection reset by peer). Attempting >> reconnect. >> Tue Aug 19 11:35:24.865 2014: Close connection to >> ebi5-014 (Connection reset by >> peer). Attempting reconnect. >> ... >> LOT MORE RESETS BY PEER >> ... >> Tue Aug 19 11:35:25.096 2014: Close connection to >> ebi5-167 (Connection reset by >> peer). Attempting reconnect. >> Tue Aug 19 11:35:25.267 2014: Connecting to >> gss02a >> Tue Aug 19 11:35:25.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:35:26.267 2014: Retry connection to > IP> gss02a >> Tue Aug 19 11:35:26.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:36:24.276 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems.* >> >> *GSS02a* >> Tue Aug 19 11:35:24.263 2014: Node >> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >> because of an expired lease.* Pings sent: 60. Replies >> received: 60. >> >> >> >> >> In example 1 seems that an NSD was not repliyng to the client, but >> the servers seems working fine.. how can i trace better ( to solve) >> the problem? >> >> In example 2 it seems to me that for some reason the manager are not >> renewing the lease in time. when this happens , its not a single client. >> Loads of them fail to get the lease renewed. Why this is happening? >> how can i trace to the source of the problem? >> >> >> >> Thanks in advance for any tips. >> >> Regards, >> Salvatore >> >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Aug 21 13:48:38 2014 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 21 Aug 2014 12:48:38 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>,<53F5B62F.1060305@ebi.ac.uk> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. [cid:2fb2253c-3ffb-4ac6-88a8-d019b1a24f66] Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From jbernard at jumptrading.com Thu Aug 21 13:52:05 2014 From: jbernard at jumptrading.com (Jon Bernard) Date: Thu, 21 Aug 2014 12:52:05 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk>, <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: Where is that from? On Aug 21, 2014, at 7:49, "Bryan Banister" > wrote: As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From viccornell at gmail.com Thu Aug 21 14:03:14 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:03:14 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Hi Salvatore, Are you using ethernet or infiniband as the GPFS interconnect to your clients? If 10/40GbE - do you have a separate admin network? I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. Vic On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: > Thanks for the feedback, but we managed to find a scenario that excludes network problems. > > we have a file called input_file of nearly 100GB: > > if from client A we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on the same client) client B : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. > Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. > > > > As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. > Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> EXAMPLE 1: >>> EBI5-220 ( CLIENT) >>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> GSS02B ( NSD SERVER) >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>> ... >>> >>> GSS02a ( NSD SERVER) >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>> >>> >>> =============================================== >>> EXAMPLE 2: >>> >>> EBI5-038 >>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> >>> GSS02a >>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:04:59 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:04:59 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk> <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: <53F5EE7B.2080306@ebi.ac.uk> Thanks for the info... it helps a bit understanding whats going on, but i think you missed the part that Node A and Node B could also be the same machine. If for instance i ran 2 cp on the same machine, hence Client B cannot have problems contacting Client A since they are the same machine..... BTW i did the same also using 2 clients and the result its the same. Nonetheless your description is made me understand a bit better what's going on Regards, Salvatore On 21/08/14 13:48, Bryan Banister wrote: > As I understand GPFS distributed locking semantics, GPFS will not > allow one node to hold a write lock for a file indefinitely. Once > Client B opens the file for writing it would have contacted the File > System Manager to obtain the lock. The FS manager would have told > Client B that Client A has the lock and that Client B would have to > contact Client A and revoke the write lock token. If Client A does > not respond to Client B's request to revoke the write token, then > Client B will ask that Client A be expelled from the cluster for NOT > adhering to the proper protocol for write lock contention. > > > > Have you checked the communication path between the two clients at > this point? > > I could not follow the logs that you provided. You should definitely > look at the exact sequence of log events on the two clients and the > file system manager (as reported by mmlsmgr). > > Hope that helps, > -Bryan > > ------------------------------------------------------------------------ > *From:* gpfsug-discuss-bounces at gpfsug.org > [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo > [sdinardo at ebi.ac.uk] > *Sent:* Thursday, August 21, 2014 4:04 AM > *To:* chair at gpfsug.org; gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] gpfs client expels > > Thanks for the feedback, but we managed to find a scenario that > excludes network problems. > > we have a file called */input_file/* of nearly 100GB: > > if from *client A* we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they > flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on > the same client) *client B* : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all > the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for > one of the "cat", so have to wait ( and since the file is BIG , have > to wait for a while), its not understandable why it stop the renewal > lease. > Why its doen't return just a timeout error on the copy instead to > expel the node? We can reproduce this every time, and since our users > to operations like this on files over 100GB each you can imagine the > result. > > > > As you can imagine even if its a bit silly to write at the same time > to the same destination, its also quite common if we want to dump to a > log file logs and for some reason one of the writers, write for a lot > of time keeping the file locked. > Our expels are not due to network congestion, but because a write > attempts have to wait another one. What i really dont understand is > why to take a so expreme mesure to expell jest because a process is > waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under > investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a >> better term now GSS is out) and seen ping 'working', but alongside >> ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping >> - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort >> of another. If the network is experiencing issues, nodes will be >> ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've >> seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order >> to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, >> directly or via your OEM and collect and supply a snap and traces as >> required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> *EXAMPLE 1:* >>> >>> *EBI5-220**( CLIENT)** >>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>> reply from node gss02b* >>> Tue Aug 19 11:03:04.981 2014: Request sent to >> IP> (gss02a in GSS.ebi.ac.uk) to expel >>> (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled >>> from cluster GSS.ebi.ac.uk due to expel msg from >>> (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >>> broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >>> GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>> invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to >>> gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to >>> gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to >>> gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to >>> gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to >>> gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to >>> gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to >>> gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to >>> gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to >>> gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>> in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> *GSS02B ( NSD SERVER)* >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>> ** ebi5-220 >>> ... >>> >>> *GSS02a ( NSD SERVER)* >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >>> request from (ebi5-220 in >>> ebi-cluster.ebi.ac.uk). Expelling: >>> (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>> ebi5-220 >>> >>> >>> =============================================== >>> *EXAMPLE 2*: >>> >>> *EBI5-038* >>> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >>> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >>> cluster GSS.ebi.ac.uk* >>> Tue Aug 19 11:35:24.265 2014: Close connection to >>> gss02a (Connection reset by peer). >>> Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to >>> ebi5-014 (Connection reset by >>> peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to >>> ebi5-167 (Connection reset by >>> peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>> gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>> cluster GSS.ebi.ac.uk. Unmounting file systems.* >>> >>> *GSS02a* >>> Tue Aug 19 11:35:24.263 2014: Node >>> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >>> because of an expired lease.* Pings sent: 60. Replies >>> received: 60. >>> >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but >>> the servers seems working fine.. how can i trace better ( to solve) >>> the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not >>> renewing the lease in time. when this happens , its not a single >>> client. >>> Loads of them fail to get the lease renewed. Why this is happening? >>> how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged > information. If you are not the intended recipient, you are hereby > notified that any review, dissemination or copying of this email is > strictly prohibited, and to please notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does > not make any guarantees as to the completeness or accuracy of this > email or any attachments. This email is for informational purposes > only and does not constitute a recommendation, offer, request or > solicitation of any kind to buy, sell, subscribe, redeem or perform > any type of transaction of a financial product. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 249179 bytes Desc: not available URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:18:19 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:18:19 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Message-ID: <53F5F19B.1010603@ebi.ac.uk> This is an interesting point! We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. I will forward this consideration to our network admin , so see if we can work on a dedicated network. thanks for your tip. Regards, Salvatore On 21/08/14 14:03, Vic Cornell wrote: > Hi Salvatore, > > Are you using ethernet or infiniband as the GPFS interconnect to your > clients? > > If 10/40GbE - do you have a separate admin network? > > I have seen behaviour similar to this where the storage traffic causes > congestion and the "admin" traffic gets lost or delayed causing expels. > > Vic > > > > On 21 Aug 2014, at 10:04, Salvatore Di Nardo > wrote: > >> Thanks for the feedback, but we managed to find a scenario that >> excludes network problems. >> >> we have a file called */input_file/* of nearly 100GB: >> >> if from *client A* we do: >> >> cat input_file >> output_file >> >> it start copying.. and we see waiter goeg a bit up,secs but then they >> flushes back to 0, so we xcan say that the copy proceed well... >> >> >> if now we do the same from another client ( or just another shell on >> the same client) *client B* : >> >> cat input_file >> output_file >> >> >> ( in other words we are trying to write to the same destination) all >> the waiters gets up until one node get expelled. >> >> >> Now, while its understandable that the destination file is locked for >> one of the "cat", so have to wait ( and since the file is BIG , have >> to wait for a while), its not understandable why it stop the renewal >> lease. >> Why its doen't return just a timeout error on the copy instead to >> expel the node? We can reproduce this every time, and since our users >> to operations like this on files over 100GB each you can imagine the >> result. >> >> >> >> As you can imagine even if its a bit silly to write at the same time >> to the same destination, its also quite common if we want to dump to >> a log file logs and for some reason one of the writers, write for a >> lot of time keeping the file locked. >> Our expels are not due to network congestion, but because a write >> attempts have to wait another one. What i really dont understand is >> why to take a so expreme mesure to expell jest because a process is >> waiteing "to too much time". >> >> >> I have ticket opened to IBM for this and the issue is under >> investigation, but no luck so far.. >> >> Regards, >> Salvatore >> >> >> >> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>> Hi there, >>> >>> I've seen the on several 'stock'? 'core'? GPFS system (we need a >>> better term now GSS is out) and seen ping 'working', but alongside >>> ejections from the cluster. >>> The GPFS internode 'ping' is somewhat more circumspect than unix >>> ping - and rightly so. >>> >>> In my experience this has _always_ been a network issue of one sort >>> of another. If the network is experiencing issues, nodes will be >>> ejected. >>> Of course it could be unresponsive mmfsd or high loadavg, but I've >>> seen that only twice in 10 years over many versions of GPFS. >>> >>> You need to follow the logs through from each machine in time order >>> to determine who could not see who and in what order. >>> Your best way forward is to log a SEV2 case with IBM support, >>> directly or via your OEM and collect and supply a snap and traces as >>> required by support. >>> >>> Without knowing your full setup, it's hard to help further. >>> >>> Jez >>> >>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>> Still problems. Here some more detailed examples: >>>> >>>> *EXAMPLE 1:* >>>> >>>> *EBI5-220**( CLIENT)** >>>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>>> reply from node gss02b* >>>> Tue Aug 19 11:03:04.981 2014: Request sent to >>> IP> (gss02a in GSS.ebi.ac.uk ) to >>>> expel (gss02b in GSS.ebi.ac.uk >>>> ) from cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:04.982 2014: This node will be >>>> expelled from cluster GSS.ebi.ac.uk >>>> due to expel msg from >>> IP> (ebi5-220) >>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager >>>> connection broke. Probing cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:03:10.322 2014: Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems. >>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>>> invoked. File system: gpfs1 Reason: SGPanic >>>> Tue Aug 19 11:03:12.066 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:03:12.070 2014: Connected to >>>> gss02a >>>> Tue Aug 19 11:03:17.071 2014: Connecting to >>>> gss02b >>>> Tue Aug 19 11:03:17.072 2014: Connecting to >>>> gss03b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss03a >>>> Tue Aug 19 11:03:17.080 2014: Connecting to >>>> gss01b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss01a >>>> Tue Aug 19 11:04:23.105 2014: Connected to >>>> gss02b >>>> Tue Aug 19 11:04:23.107 2014: Connected to >>>> gss03b >>>> Tue Aug 19 11:04:23.112 2014: Connected to >>>> gss03a >>>> Tue Aug 19 11:04:23.115 2014: Connected to >>>> gss01b >>>> Tue Aug 19 11:04:23.121 2014: Connected to >>>> gss01a >>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>>> in GSS.ebi.ac.uk ) is now the >>>> Group Leader. >>>> >>>> *GSS02B ( NSD SERVER)* >>>> ... >>>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>>> ** ebi5-220 >>>> ... >>>> >>>> *GSS02a ( NSD SERVER)* >>>> Tue Aug 19 11:03:04.980 2014: Expel >>>> (gss02b) request from (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ). >>>> Expelling: (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ) >>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>>> ebi5-220 >>>> >>>> >>>> =============================================== >>>> *EXAMPLE 2*: >>>> >>>> *EBI5-038* >>>> Tue Aug 19 11:32:34.227 2014: *Disk lease period >>>> expired in cluster GSS.ebi.ac.uk >>>> . Attempting to reacquire lease.* >>>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. >>>> Probing cluster GSS.ebi.ac.uk * >>>> Tue Aug 19 11:35:24.265 2014: Close connection to >>>> gss02a (Connection reset by peer). >>>> Attempting reconnect. >>>> Tue Aug 19 11:35:24.865 2014: Close connection to >>>> ebi5-014 (Connection reset by >>>> peer). Attempting reconnect. >>>> ... >>>> LOT MORE RESETS BY PEER >>>> ... >>>> Tue Aug 19 11:35:25.096 2014: Close connection to >>>> ebi5-167 (Connection reset by >>>> peer). Attempting reconnect. >>>> Tue Aug 19 11:35:25.267 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:35:25.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>>> gss02a >>>> Tue Aug 19 11:35:26.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems.* >>>> >>>> *GSS02a* >>>> Tue Aug 19 11:35:24.263 2014: Node >>>> (ebi5-038 in ebi-cluster.ebi.ac.uk >>>> ) *is being expelled >>>> because of an expired lease.* Pings sent: 60. Replies >>>> received: 60. >>>> >>>> >>>> >>>> >>>> In example 1 seems that an NSD was not repliyng to the client, but >>>> the servers seems working fine.. how can i trace better ( to solve) >>>> the problem? >>>> >>>> In example 2 it seems to me that for some reason the manager are >>>> not renewing the lease in time. when this happens , its not a >>>> single client. >>>> Loads of them fail to get the lease renewed. Why this is happening? >>>> how can i trace to the source of the problem? >>>> >>>> >>>> >>>> Thanks in advance for any tips. >>>> >>>> Regards, >>>> Salvatore >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss atgpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss atgpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Aug 21 14:19:33 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 21 Aug 2014 15:19:33 +0200 (CEST) Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <1481989063.92260.1408627173332.open-xchange@oxbaltgw09.schlund.de> > Now, while its understandable that the destination file is locked for one of > the "cat", so have to wait If GPFS is posix compatible i do not understand why a cat should block the other cat completly meanings on a standard FS you can "cat" from many source to the same target. Of course the result is not predictable. >From this point of view i would expect that both "cat" would start writing immediately thus i would expect a GPFS bug. All imho. Hajo Note: You might test which the input_file in a different directory and i would test the behaviour if the output_file is on a local FS like /tmp. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Thu Aug 21 14:22:22 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:22:22 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5F19B.1010603@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> <53F5F19B.1010603@ebi.ac.uk> Message-ID: <0F03996A-2008-4076-9A2B-B4B2BB89E959@gmail.com> For my system I always use a dedicated admin network - as described in the gpfs manuals - for a gpfs cluster on 10/40GbE where the system will be heavily loaded. The difference in the stability of the system is very noticeable. Not sure how/if this would work on GSS - IBM ought to know :-) Vic On 21 Aug 2014, at 14:18, Salvatore Di Nardo wrote: > This is an interesting point! > > We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. > > Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. > > I will forward this consideration to our network admin , so see if we can work on a dedicated network. > > thanks for your tip. > > Regards, > Salvatore > > > > > On 21/08/14 14:03, Vic Cornell wrote: >> Hi Salvatore, >> >> Are you using ethernet or infiniband as the GPFS interconnect to your clients? >> >> If 10/40GbE - do you have a separate admin network? >> >> I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. >> >> Vic >> >> >> >> On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: >> >>> Thanks for the feedback, but we managed to find a scenario that excludes network problems. >>> >>> we have a file called input_file of nearly 100GB: >>> >>> if from client A we do: >>> >>> cat input_file >> output_file >>> >>> it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... >>> >>> >>> if now we do the same from another client ( or just another shell on the same client) client B : >>> >>> cat input_file >> output_file >>> >>> >>> ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. >>> >>> >>> Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. >>> Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. >>> >>> >>> >>> As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. >>> Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". >>> >>> >>> I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>>> Hi there, >>>> >>>> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >>>> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >>>> >>>> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >>>> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >>>> >>>> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >>>> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >>>> >>>> Without knowing your full setup, it's hard to help further. >>>> >>>> Jez >>>> >>>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>>> Still problems. Here some more detailed examples: >>>>> >>>>> EXAMPLE 1: >>>>> EBI5-220 ( CLIENT) >>>>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>>>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>>>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>>>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>>>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>>>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>>>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>>>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>>>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>>>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>>>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>>>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>>>> >>>>> GSS02B ( NSD SERVER) >>>>> ... >>>>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>>>> ... >>>>> >>>>> GSS02a ( NSD SERVER) >>>>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>>>> >>>>> >>>>> =============================================== >>>>> EXAMPLE 2: >>>>> >>>>> EBI5-038 >>>>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>>>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>>>> ... >>>>> LOT MORE RESETS BY PEER >>>>> ... >>>>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>>>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>>>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> >>>>> GSS02a >>>>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>>>> >>>>> >>>>> >>>>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>>>> >>>>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>>>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>>>> >>>>> >>>>> >>>>> Thanks in advance for any tips. >>>>> >>>>> Regards, >>>>> Salvatore >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 22 10:37:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 22 Aug 2014 10:37:42 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F70F66.2010405@ebi.ac.uk> Hello everyone, Just to let you know, we found the cause of our problems. We discovered that not all of the recommend kernel setting was configured on the clients ( on server was everything ok, but the clients had some setting missing ), and IBM support pointed to this document that describes perfectly our issues and the fix wich suggest to raise some parameters even higher than the standard "best practice" : http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5091222 Thanks to everyone for the replies. Regards, Salvatore From ewahl at osc.edu Mon Aug 25 19:55:08 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 25 Aug 2014 18:55:08 +0000 Subject: [gpfsug-discuss] CNFS using NFS over RDMA? Message-ID: Anyone out there doing CNFS with NFS over RDMA? Is this even possible? We currently have been delivering some CNFS services using TCP over IB, but that layer tends to have a large number of bugs all the time. Like to take a look at moving back down to verbs... Ed Wahl OSC -------------- next part -------------- An HTML attachment was scrubbed... URL: From zander at ebi.ac.uk Fri Aug 1 14:44:49 2014 From: zander at ebi.ac.uk (Zander Mears) Date: Fri, 01 Aug 2014 14:44:49 +0100 Subject: [gpfsug-discuss] Hello! In-Reply-To: <53D981EF.3020000@gpfsug.org> References: <53D8C897.9000902@ebi.ac.uk> <53D981EF.3020000@gpfsug.org> Message-ID: <53DB99D1.8050304@ebi.ac.uk> Hi Jez We're just monitoring the standard OS stuff, some interface errors, throughput, number of network and gpfs connections due to previous issues. We don't really know as yet what is good to monitor GPFS wise. cheers Zander On 31/07/2014 00:38, Jez Tucker (Chair) wrote: > Hi Zander, > > We have a git repository. Would you be interested in adding any > Zabbix custom metrics gathering to GPFS to it? > > https://github.com/gpfsug/gpfsug-tools > > Best, > > Jez From sfadden at us.ibm.com Tue Aug 5 18:55:20 2014 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 5 Aug 2014 10:55:20 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are doing it, I am looking for some confirmation. Thanks Scott Fadden GPFS Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/gpfs -------------- next part -------------- An HTML attachment was scrubbed... URL: From u.sibiller at science-computing.de Wed Aug 6 08:46:31 2014 From: u.sibiller at science-computing.de (Ulrich Sibiller) Date: Wed, 06 Aug 2014 09:46:31 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1DD57.90103@science-computing.de> Am 05.08.2014 19:55, schrieb Scott Fadden: > Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are > doing it, I am looking for some confirmation. I have some nodes running lustre 2.1.6 or 2.5.58 and gpfs 3.5.0.17 on RHEL5.8 and RHEL6.5. None of them are servers. Kind regards, Ulrich Sibiller -- ______________________________________creating IT solutions Dipl.-Inf. Ulrich Sibiller science + computing ag System Administration Hagellocher Weg 73 mail nfz at science-computing.de 72070 Tuebingen, Germany hotline +49 7071 9457 674 http://www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 From frederik.ferner at diamond.ac.uk Wed Aug 6 10:19:35 2014 From: frederik.ferner at diamond.ac.uk (Frederik Ferner) Date: Wed, 6 Aug 2014 10:19:35 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1F327.1000605@diamond.ac.uk> On 05/08/14 18:55, Scott Fadden wrote: > Is anyone running GPFS and Lustre on the same nodes. I have seen it > work, I have heard people are doing it, I am looking for some confirmation. Most of our compute cluster nodes are clients for Lustre and GPFS at the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on servers (GPFS NSD server or Lustre OSS/MDS servers). HTH, Frederik -- Frederik Ferner Senior Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From sdinardo at ebi.ac.uk Wed Aug 6 10:57:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 06 Aug 2014 10:57:44 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1F327.1000605@diamond.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> Message-ID: <53E1FC18.6080707@ebi.ac.uk> Sorry for this little ot, but recetly i'm looking to Lustre to understand how it is comparable to GPFS in terms of performance, reliability and easy to use. Could anyone share their experience ? My company just recently got a first GPFS system , based on IBM GSS, but while its good performance wise, there are few unresolved problems and the IBM support is almost unexistent, so I'm starting to wonder if its work to look somewhere else eventual future purchases. Salvatore On 06/08/14 10:19, Frederik Ferner wrote: > On 05/08/14 18:55, Scott Fadden wrote: >> Is anyone running GPFS and Lustre on the same nodes. I have seen it >> work, I have heard people are doing it, I am looking for some >> confirmation. > > Most of our compute cluster nodes are clients for Lustre and GPFS at > the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on > servers (GPFS NSD server or Lustre OSS/MDS servers). > > HTH, > Frederik > From chair at gpfsug.org Wed Aug 6 11:19:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 06 Aug 2014 11:19:24 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1FC18.6080707@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> Message-ID: <53E2012C.9040402@gpfsug.org> "IBM support is almost unexistent" I don't find that at all. Do you log directly via ESC or via your OEM/integrator or are you only referring to GSS support rather than pure GPFS? If you are having response issues, your IBM rep (or a few folks on here) can accelerate issues for you. Jez On 06/08/14 10:57, Salvatore Di Nardo wrote: > Sorry for this little ot, but recetly i'm looking to Lustre to > understand how it is comparable to GPFS in terms of performance, > reliability and easy to use. > Could anyone share their experience ? > > My company just recently got a first GPFS system , based on IBM GSS, > but while its good performance wise, there are few unresolved problems > and the IBM support is almost unexistent, so I'm starting to wonder if > its work to look somewhere else eventual future purchases. > > > Salvatore > > On 06/08/14 10:19, Frederik Ferner wrote: >> On 05/08/14 18:55, Scott Fadden wrote: >>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>> work, I have heard people are doing it, I am looking for some >>> confirmation. >> >> Most of our compute cluster nodes are clients for Lustre and GPFS at >> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >> servers (GPFS NSD server or Lustre OSS/MDS servers). >> >> HTH, >> Frederik >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Wed Aug 6 14:26:47 2014 From: service at metamodul.com (service at metamodul.com) Date: Wed, 6 Aug 2014 15:26:47 +0200 (CEST) Subject: [gpfsug-discuss] Hi , i am new to this list Message-ID: <1366482624.222989.1407331607965.open-xchange@oxbaltgw55.schlund.de> Hi @ALL i am Hajo Ehlers , an AIX and GPFS specialist ( Unix System Engineer ). You find me at the IBM GPFS Forum and sometimes at news:c.u.a and I am addicted to cluster filesystems My latest idee is an SAP-HANA light system ( DBMS on an in-memory cluster posix FS ) which could be extended to a "reinvented" Cluster based AS/400 ^_^ I wrote also a small script to do a sequential backup of GPFS filesystems since i got never used to mmbackup - i named it "pdsmc" for parallel dsmc". Cheers Hajo BTW: Please let me know - service (at) metamodul (dot) com - In case somebody is looking for a GPFS specialist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 8 10:53:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 08 Aug 2014 10:53:36 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E2012C.9040402@gpfsug.org> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> Message-ID: <53E49E20.1090905@ebi.ac.uk> Well, i didn't wanted to start a rant against IBM, and I'm referring specifically to GSS. Since GSS its an appliance, we have to refer to GSS support for both hardware and software issues. Hardware support in total crap. It took 1 mounth of chasing and shouting to get a drawer replacement that was causing some issues. Meanwhile 10 disks in that drawer got faulty. Finally we got the drawer replace but the disks are still faulty. Now its 3 days i'm triing to get them fixed or replaced ( its not clear if they disks are broken of they was just marked to be replaced because of the drawer). Right now i dont have any answer about how to put them online ( mmchcarrier don't work because it recognize that the disk where not replaced) There are also few other cases ( gpfs related) open that are still not answered. I have no experience with direct GPFS support, but if i open a case to GSS for a GPFS problem, the cases seems never get an answer. The only reason that GSS is working its because _*I*_**installed it spending few months studying gpfs. So now I'm wondering if its worth at all rely in future on the whole appliance concept. I'm wondering if in future its better just purchase the hardware and install GPFS by our own, or in alternatively even try Lustre. Now, skipping all this GSS rant, which have nothing to do with the file system anyway and going back to my question: Could someone point the main differences between GPFS and Lustre? I found some documentation about Lustre and i'm going to have a look, but oddly enough have not found any practical comparison between them. On 06/08/14 11:19, Jez Tucker (Chair) wrote: > "IBM support is almost unexistent" > > I don't find that at all. > Do you log directly via ESC or via your OEM/integrator or are you only > referring to GSS support rather than pure GPFS? > > If you are having response issues, your IBM rep (or a few folks on > here) can accelerate issues for you. > > Jez > > > On 06/08/14 10:57, Salvatore Di Nardo wrote: >> Sorry for this little ot, but recetly i'm looking to Lustre to >> understand how it is comparable to GPFS in terms of performance, >> reliability and easy to use. >> Could anyone share their experience ? >> >> My company just recently got a first GPFS system , based on IBM GSS, >> but while its good performance wise, there are few unresolved >> problems and the IBM support is almost unexistent, so I'm starting to >> wonder if its work to look somewhere else eventual future purchases. >> >> >> Salvatore >> >> On 06/08/14 10:19, Frederik Ferner wrote: >>> On 05/08/14 18:55, Scott Fadden wrote: >>>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>>> work, I have heard people are doing it, I am looking for some >>>> confirmation. >>> >>> Most of our compute cluster nodes are clients for Lustre and GPFS at >>> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >>> servers (GPFS NSD server or Lustre OSS/MDS servers). >>> >>> HTH, >>> Frederik >>> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpro at bas.ac.uk Fri Aug 8 12:40:00 2014 From: jpro at bas.ac.uk (Jeremy Robst) Date: Fri, 8 Aug 2014 12:40:00 +0100 (BST) Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E49E20.1090905@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > Now, skipping all this GSS rant, which have nothing to do with the file > system anyway? and? going back to my question: > > Could someone point the main differences between GPFS and Lustre? I'm looking at making the same decision here - to buy GPFS or to roll our own Lustre configuration. I'm in the process of setting up test systems, and so far the main difference seems to be in the that in GPFS each server sees the full filesystem, and so you can run other applications (e.g backup) on a GPFS server whereas the Luste OSS (object storage servers) see only a portion of the storage (the filesystem is striped across the OSSes), so you need a Lustre client to mount the full filesystem for things like backup. However I have very little practical experience of either and would also be interested in any comments. Thanks Jeremy -- jpro at bas.ac.uk | (work) 01223 221402 (fax) 01223 362616 Unix System Administrator - British Antarctic Survey #include From keith at ocf.co.uk Fri Aug 8 14:12:39 2014 From: keith at ocf.co.uk (Keith Vickers) Date: Fri, 8 Aug 2014 14:12:39 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: http://www.pdsw.org/pdsw10/resources/posters/parallelNASFSs.pdf Has a good direct apples to apples comparison between Lustre and GPFS. It's pretty much abstractable from the hardware used. Keith Vickers Business Development Manager OCF plc Mobile: 07974 397863 From sergi.more at bsc.es Fri Aug 8 14:14:33 2014 From: sergi.more at bsc.es (=?ISO-8859-1?Q?Sergi_Mor=E9_Codina?=) Date: Fri, 08 Aug 2014 15:14:33 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: <53E4CD39.7080808@bsc.es> Hi all, About main differences between GPFS and Lustre, here you have some bits from our experience: -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. Best Regards, Sergi. On 08/08/2014 01:40 PM, Jeremy Robst wrote: > On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > >> Now, skipping all this GSS rant, which have nothing to do with the file >> system anyway and going back to my question: >> >> Could someone point the main differences between GPFS and Lustre? > > I'm looking at making the same decision here - to buy GPFS or to roll > our own Lustre configuration. I'm in the process of setting up test > systems, and so far the main difference seems to be in the that in GPFS > each server sees the full filesystem, and so you can run other > applications (e.g backup) on a GPFS server whereas the Luste OSS (object > storage servers) see only a portion of the storage (the filesystem is > striped across the OSSes), so you need a Lustre client to mount the full > filesystem for things like backup. > > However I have very little practical experience of either and would also > be interested in any comments. > > Thanks > > Jeremy > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- ------------------------------------------------------------------------ Sergi More Codina Barcelona Supercomputing Center Centro Nacional de Supercomputacion WWW: http://www.bsc.es Tel: +34-93-405 42 27 e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 ------------------------------------------------------------------------ WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3242 bytes Desc: S/MIME Cryptographic Signature URL: From viccornell at gmail.com Fri Aug 8 18:15:30 2014 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 8 Aug 2014 18:15:30 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E4CD39.7080808@bsc.es> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> Message-ID: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at us.ibm.com Fri Aug 8 20:09:44 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 8 Aug 2014 12:09:44 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Message-ID: Vic, Sergi, you can not compare Lustre and GPFS without providing a clear usecase as otherwise you compare apple with oranges. the reason for this is quite simple, Lustre plays well in pretty much one usecase - HPC, GPFS on the other hand is used in many forms of deployments from Storage for Virtual Machines, HPC, Scale-Out NAS, Solutions in digital media, to hosting some of the biggest, most business critical Transactional database installations in the world. you look at 2 products with completely different usability spectrum, functions and features unless as said above you narrow it down to a very specific usecase with a lot of details. even just HPC has a very large spectrum and not everybody is working in a single directory, which is the main scale point for Lustre compared to GPFS and the reason is obvious, if you have only 1 active metadata server (which is what 99% of all lustre systems run) some operations like single directory contention is simpler to make fast, but only up to the limit of your one node, but what happens when you need to go beyond that and only a real distributed architecture can support your workload ? for example look at most chip design workloads, which is a form of HPC, it is something thats extremely metadata and small file dominated, you talk about 100's of millions (in some cases even billions) of files, majority of them <4k, the rest larger files , majority of it with random access patterns that benefit from massive client side caching and distributed data coherency models supported by GPFS token manager infrastructure across 10's or 100's of metadata server and 1000's of compute nodes. you also need to look at the rich feature set GPFS provides, which not all may be important for some environments but are for others like Snapshot, Clones, Hierarchical Storage Management (ILM) , Local Cache acceleration (LROC), Global Namespace Wan Integration (AFM), Encryption, etc just to name a few. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Vic Cornell To: gpfsug main discussion list Date: 08/08/2014 10:16 AM Subject: Re: [gpfsug-discuss] GPFS and Lustre on same node Sent by: gpfsug-discuss-bounces at gpfsug.org Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kraemerf at de.ibm.com Sat Aug 9 15:03:02 2014 From: kraemerf at de.ibm.com (Frank Kraemer) Date: Sat, 9 Aug 2014 16:03:02 +0200 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: Message-ID: Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany From ewahl at osc.edu Mon Aug 11 14:55:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 11 Aug 2014 13:55:48 +0000 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: , Message-ID: In a similar vein, IBM has an application transparent "File Cache Library" as well. I believe it IS licensed and the only requirement is that it is for use on IBM hardware only. Saw some presentations that mention it in some BioSci talks @SC13 and the numbers for a couple of selected small read applications were awesome. I probably have the contact info for it around here somewhere. In addition to the pdf/user manual. Ed Wahl Ohio Supercomputer Center ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Frank Kraemer [kraemerf at de.ibm.com] Sent: Saturday, August 09, 2014 10:03 AM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] GPFS and Lustre Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From sabujp at gmail.com Tue Aug 12 23:16:22 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 12 Aug 2014 17:16:22 -0500 Subject: [gpfsug-discuss] reduce cnfs failover time to a few seconds Message-ID: Hi all, Is there anyway to reduce CNFS failover time to just a few seconds? Currently it seems like it's taking 5 - 10 minutes. We're using virtual ip's, i.e. interface bond1.1550:0 has one of the cnfs vips, so it should be fast, but it takes a long time and sometimes causes processes to crash due to NFS timeouts (some have 600 second soft mount timeouts). We've also noticed that it sometimes takes even longer unless the cnfs system on which we're calling mmshutdown is completely shutdown and isn't returning pings. Even 1 min seems too long. For comparison, I'm running ctdb + samba on the other NSDs and it's able to failover in a few seconds after mmshutdown completes. Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 15 14:31:29 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 15 Aug 2014 14:31:29 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters Message-ID: <53EE0BB1.8000005@ebi.ac.uk> Hello people, Its quite a bit of time that i'm triing to solve a problem to our GPFS system, without much luck so i think its time to ask some help. *First of a bit of introduction:** * Our GPFS system is made by 3xgss-26, In other words its made with 6x servers ( 4x10g links each) and several disk enclosures SAS attacked. The todal amount of spare its roughly 2PB, and the disks are SATA ( except few SSD dedicated to logtip ). My metadata and on dedicated vdisks, but both data and metadata vdiosks are in the same declustered arrays and recovery groups, so in the end they share the same spindles. The clients its a LSF farm configured as another cluster ( standard multiclustering configuration) of roughly 600 nodes . *The issue:** * Recently we became aware that when some massive io request has been done we experience a lot of client expells. Heres an example of our logs: Fri Aug 15 12:40:24.680 2014: Expel 10.7.28.34 (gss03a) request from 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:41.652 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:45.754 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:52.305 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:17.069 2014: Expel 10.7.28.35 (gss03b) request from 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:23.555 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.258 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.540 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:42:57.288 2014: Expel 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) request from 10.7.28.34 (gss03a). Expelling: 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:43:24.327 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:44:54.202 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:15:54.450 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:20:16.524 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:26:54.177 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:27:53.900 2014: Expel 10.7.28.3 (gss01b) request from 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:28:24.297 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:29:23.913 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk) at the same time we experience also long waiters queue (1000+ lines). An example in case of massive writes ( dd ) : 0x7F522E1EEF90 waiting 1.861233182 seconds, NSDThread: on ThCond 0x7F5158019B08 (0x7F5158019B08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.101 0x7F522E1EC9B0 waiting 1.490567470 seconds, NSDThread: on ThCond 0x7F50F4038BA8 (0x7F50F4038BA8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1EB6C0 waiting 1.077098046 seconds, NSDThread: on ThCond 0x7F50B40011F8 (0x7F50B40011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.156 0x7F522E1EA3D0 waiting 7.714968554 seconds, NSDThread: on ThCond 0x7F50BC0078B8 (0x7F50BC0078B8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.107 0x7F522E1E90E0 waiting 4.774379417 seconds, NSDThread: on ThCond 0x7F506801B1F8 (0x7F506801B1F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.23 0x7F522E1E7DF0 waiting 0.746172444 seconds, NSDThread: on ThCond 0x7F5094007D78 (0x7F5094007D78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.84 0x7F522E1E6B00 waiting 1.553030487 seconds, NSDThread: on ThCond 0x7F51C0004C78 (0x7F51C0004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.63 0x7F522E1E5810 waiting 2.165307633 seconds, NSDThread: on ThCond 0x7F5178016A08 (0x7F5178016A08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.29 0x7F522E1E4520 waiting 1.128089273 seconds, NSDThread: on ThCond 0x7F5074004D98 (0x7F5074004D98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1E3230 waiting 2.515214328 seconds, NSDThread: on ThCond 0x7F51F400EF08 (0x7F51F400EF08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.90 0x7F522E1E1F40 waiting*162.966840834* seconds, NSDThread: on ThCond 0x7F51840207A8 (0x7F51840207A8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.97 0x7F522E1E0C50 waiting 1.140787288 seconds, NSDThread: on ThCond 0x7F51AC005C08 (0x7F51AC005C08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.94 0x7F522E1DF960 waiting 41.907415248 seconds, NSDThread: on ThCond 0x7F5160019038 (0x7F5160019038) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1DE670 waiting 0.466560418 seconds, NSDThread: on ThCond 0x7F513802B258 (0x7F513802B258) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.168 0x7F522E1DD380 waiting 3.102803621 seconds, NSDThread: on ThCond 0x7F516C0106C8 (0x7F516C0106C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.91 0x7F522E1DC090 waiting 2.751614295 seconds, NSDThread: on ThCond 0x7F504C0011F8 (0x7F504C0011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.25 0x7F522E1DADA0 waiting 5.083691891 seconds, NSDThread: on ThCond 0x7F507401BE88 (0x7F507401BE88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1D9AB0 waiting 2.263374184 seconds, NSDThread: on ThCond 0x7F5080003B98 (0x7F5080003B98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.36 0x7F522E1D87C0 waiting 0.206989639 seconds, NSDThread: on ThCond 0x7F505801F0D8 (0x7F505801F0D8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.55 0x7F522E1D74D0 waiting *41.841279897* seconds, NSDThread: on ThCond 0x7F5194008B88 (0x7F5194008B88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1D61E0 waiting 5.618652361 seconds, NSDThread: on ThCond 0x1BAB868 (0x1BAB868) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.59 0x7F522E1D4EF0 waiting 6.185658427 seconds, NSDThread: on ThCond 0x7F513802AAE8 (0x7F513802AAE8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.6 0x7F522E1D3C00 waiting 2.652370892 seconds, NSDThread: on ThCond 0x7F5130004C78 (0x7F5130004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1D2910 waiting 11.396142225 seconds, NSDThread: on ThCond 0x7F51A401C0C8 (0x7F51A401C0C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.169 0x7F522E1D1620 waiting 63.710723043 seconds, NSDThread: on ThCond 0x7F5038004D08 (0x7F5038004D08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.120 or for massive reads: 0x7FBCE69A8C20 waiting 29.262629530 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE699CEC0 waiting 29.260869141 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE698C5A0 waiting 29.124824888 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6984110 waiting 22.729479654 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69512C0 waiting 29.272805926 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69409A0 waiting 28.833650198 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6924320 waiting 29.237067128 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6921D40 waiting 29.237953228 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6915FE0 waiting 29.046721161 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6913A00 waiting 29.264534710 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6900B00 waiting 29.267691105 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68F7380 waiting 29.266402464 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68D2870 waiting 29.276298231 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68BADB0 waiting 28.665700576 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68B61F0 waiting 29.236878611 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6885980 waiting *144*.530487248 seconds, NSDThread: on ThMutex 0x1803396A670 (0xFFFFC9003396A670) (DiskSchedulingMutex) 0x7FBCE68833A0 waiting 29.231066610 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68820B0 waiting 29.269954514 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE686A5F0 waiting *140*.662994256 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE6864740 waiting 29.254180742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE683FC30 waiting 29.271840565 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE682E020 waiting 29.200969209 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6825B90 waiting 19.136732919 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6805C40 waiting 29.236055550 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FEAA0 waiting 29.283264161 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FC4C0 waiting 29.268992663 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67DFE40 waiting 29.150900786 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D2DF0 waiting 29.199058463 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D1B00 waiting 29.203199738 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67768D0 waiting 29.208231742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6768590 waiting 5.228192589 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67672A0 waiting 29.252839376 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6757C70 waiting 28.869359044 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6748640 waiting 29.289284179 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6734450 waiting 29.253591817 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6730B80 waiting 29.289987273 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6720260 waiting 26.597589551 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66F32C0 waiting 29.177692849 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66E3C90 waiting 29.160268518 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66CC1D0 waiting 5.334330188 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66B3420 waiting 34.274433161 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE668E910 waiting 27.699999488 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6689D50 waiting 34.279090465 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE66805D0 waiting 24.688626241 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6675B60 waiting 35.367745840 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE665E0A0 waiting 29.235994598 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE663CE60 waiting 29.162911979 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' Another example with mmfsadm in case of massive reads: [root at gss02b ~]# mmfsadm dump waiters 0x7F519000AEA0 waiting 28.915010347 seconds, replyCleanupThread: on ThCond 0x7F51101B27B8 (0x7F51101B27B8) (MsgRecordCondvar), reason 'RPC wait' 0x7F511C012A10 waiting 279.522206863 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F52000095F8 (0x7F52000095F8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5120000B80 waiting 279.524782437 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5214000EE8 (0x7F5214000EE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5154006310 waiting 138.164386224 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5174003F08 (0x7F5174003F08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1EB6C0 waiting 23.060703000 seconds, NSDThread: for poll on sock 85 0x7F522E1E6B00 waiting 0.068456104 seconds, NSDThread: on ThCond 0x7F50CC00E478 (0x7F50CC00E478) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1D0330 waiting 17.207907857 seconds, NSDThread: on ThCond 0x7F5078001688 (0x7F5078001688) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1BFA10 waiting 0.181011711 seconds, NSDThread: on ThCond 0x7F504000E558 (0x7F504000E558) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B4FA0 waiting 0.021780338 seconds, NSDThread: on ThCond 0x7F522000E488 (0x7F522000E488) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B3CB0 waiting 0.794718000 seconds, NSDThread: for poll on sock 799 0x7F522E186D10 waiting 0.191606803 seconds, NSDThread: on ThCond 0x7F5184015D58 (0x7F5184015D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E184730 waiting 0.025562000 seconds, NSDThread: for poll on sock 867 0x7F522E12CDD0 waiting 0.008921000 seconds, NSDThread: for poll on sock 543 0x7F522E126F20 waiting 1.459531000 seconds, NSDThread: for poll on sock 983 0x7F522E10F460 waiting 17.177936972 seconds, NSDThread: on ThCond 0x7F51EC002CE8 (0x7F51EC002CE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E101120 waiting 17.232580316 seconds, NSDThread: on ThCond 0x7F51BC005BB8 (0x7F51BC005BB8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0F1AF0 waiting 438.556030000 seconds, NSDThread: for poll on sock 496 0x7F522E0E7080 waiting 393.702839774 seconds, NSDThread: on ThCond 0x7F5164013668 (0x7F5164013668) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E09DA60 waiting 52.746984660 seconds, NSDThread: on ThCond 0x7F506C008858 (0x7F506C008858) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E084CB0 waiting 23.096688206 seconds, NSDThread: on ThCond 0x7F521C008E18 (0x7F521C008E18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0839C0 waiting 0.093456000 seconds, NSDThread: for poll on sock 962 0x7F522E076970 waiting 2.236659731 seconds, NSDThread: on ThCond 0x7F51E0027538 (0x7F51E0027538) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E044E10 waiting 52.752497765 seconds, NSDThread: on ThCond 0x7F513802BDD8 (0x7F513802BDD8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E033200 waiting 16.157355796 seconds, NSDThread: on ThCond 0x7F5104240D58 (0x7F5104240D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E02AD70 waiting 436.025203220 seconds, NSDThread: on ThCond 0x7F50E0016C28 (0x7F50E0016C28) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E01A450 waiting 393.673252777 seconds, NSDThread: on ThCond 0x7F50A8009C18 (0x7F50A8009C18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DFE0460 waiting 1.781358358 seconds, NSDThread: on ThCond 0x7F51E0027638 (0x7F51E0027638) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF99420 waiting 0.038405427 seconds, NSDThread: on ThCond 0x7F50F0172B18 (0x7F50F0172B18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF7CDA0 waiting 438.204625355 seconds, NSDThread: on ThCond 0x7F50900023D8 (0x7F50900023D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF76EF0 waiting 435.903645734 seconds, NSDThread: on ThCond 0x7F5084004BC8 (0x7F5084004BC8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF74910 waiting 21.749325022 seconds, NSDThread: on ThCond 0x7F507C011F48 (0x7F507C011F48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF71040 waiting 1.027274000 seconds, NSDThread: for poll on sock 866 0x7F522DF536D0 waiting 52.953847324 seconds, NSDThread: on ThCond 0x7F5200006FF8 (0x7F5200006FF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF510F0 waiting 0.039278000 seconds, NSDThread: for poll on sock 837 0x7F522DF4EB10 waiting 0.085745937 seconds, NSDThread: on ThCond 0x7F51F0006828 (0x7F51F0006828) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF4C530 waiting 21.850733000 seconds, NSDThread: for poll on sock 986 0x7F522DF4B240 waiting 0.054739884 seconds, NSDThread: on ThCond 0x7F51EC0168D8 (0x7F51EC0168D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF48C60 waiting 0.186409714 seconds, NSDThread: on ThCond 0x7F51E4000908 (0x7F51E4000908) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF41AC0 waiting 438.942861290 seconds, NSDThread: on ThCond 0x7F51CC010168 (0x7F51CC010168) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF3F4E0 waiting 0.060235106 seconds, NSDThread: on ThCond 0x7F51C400A438 (0x7F51C400A438) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF22E60 waiting 0.361288000 seconds, NSDThread: for poll on sock 518 0x7F522DF21B70 waiting 0.060722464 seconds, NSDThread: on ThCond 0x7F51580162D8 (0x7F51580162D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF12540 waiting 23.077564448 seconds, NSDThread: on ThCond 0x7F512C13E1E8 (0x7F512C13E1E8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEFD060 waiting 0.723370000 seconds, NSDThread: for poll on sock 503 0x7F522DEE09E0 waiting 1.565799175 seconds, NSDThread: on ThCond 0x7F5084004D58 (0x7F5084004D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDF6F0 waiting 22.063017342 seconds, NSDThread: on ThCond 0x7F5078003E08 (0x7F5078003E08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDD110 waiting 0.049108780 seconds, NSDThread: on ThCond 0x7F5070001D78 (0x7F5070001D78) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDAB30 waiting 229.603224376 seconds, NSDThread: on ThCond 0x7F50680221B8 (0x7F50680221B8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED7260 waiting 0.071855457 seconds, NSDThread: on ThCond 0x7F506400A5A8 (0x7F506400A5A8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED5F70 waiting 0.648324000 seconds, NSDThread: for poll on sock 766 0x7F522DEC3070 waiting 1.809205756 seconds, NSDThread: on ThCond 0x7F522000E518 (0x7F522000E518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEB1460 waiting 436.017396645 seconds, NSDThread: on ThCond 0x7F51E4000978 (0x7F51E4000978) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEAC8A0 waiting 393.734102000 seconds, NSDThread: for poll on sock 609 0x7F522DEA3120 waiting 17.960778837 seconds, NSDThread: on ThCond 0x7F51B4001708 (0x7F51B4001708) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE86AA0 waiting 23.112060045 seconds, NSDThread: on ThCond 0x7F5154096118 (0x7F5154096118) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE64570 waiting 0.076167410 seconds, NSDThread: on ThCond 0x7F50D8005EF8 (0x7F50D8005EF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE1AF50 waiting 17.460836000 seconds, NSDThread: for poll on sock 737 0x7F522DE104E0 waiting 0.205037000 seconds, NSDThread: for poll on sock 865 0x7F522DDB8B80 waiting 0.106192000 seconds, NSDThread: for poll on sock 78 0x7F522DDA36A0 waiting 0.738921180 seconds, NSDThread: on ThCond 0x7F505400E048 (0x7F505400E048) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD9C500 waiting 0.731118367 seconds, NSDThread: on ThCond 0x7F503C00B518 (0x7F503C00B518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD89600 waiting 229.609363000 seconds, NSDThread: for poll on sock 515 0x7F522DD567B0 waiting 1.508489195 seconds, NSDThread: on ThCond 0x7F514C021F88 (0x7F514C021F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' Another thing worth to mention is that the filesystem its totaly unresponsive. Even a simple "cd" to a directory or an ls to a directory just hangs for several minutes ( litterally). This happens also if i try from the NSD servers. *Few things i have looked into:* * Our network seems fine, there might be some bottleneck on part of them, and this could explain the waiters, but doesnt explain why ad some poit those client ask to expel the NSD servers. THis also doesn't justify why the FS is slow even on NSD itself. * Disk bottleneck? i dont think so. NSD servers have cpu usage (and io wait ) very low. Also mmdiag --iohist seems condirming that the operation on the disks are reasonable fast: === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- ---- ------------------ --------------- 13:54:29.209276 W data 34:5066338808 2056 88.307 lcl sdtu 13:54:29.209277 W data 55:5095698936 2056 27.592 lcl sdaab 13:54:29.209278 W data 171:5104087544 2056 22.801 lcl sdtg 13:54:29.209279 W data 116:5011812856 2056 65.983 lcl sdqr 13:54:29.209280 W data 98:4860817912 2056 17.892 lcl sddl 13:54:29.209281 W data 159:4999229944 2056 21.324 lcl sdjg 13:54:29.209282 W data 84:5049561592 2056 31.932 lcl sdqz 13:54:29.209283 W data 8:5003424248 2056 30.912 lcl sdcw 13:54:29.209284 W data 23:4965675512 2056 27.366 lcl sdpt 13:54:29.297715 W vdiskMDLog 2:144008496 1 0.236 lcl sdkr 13:54:29.297717 W vdiskMDLog 0:331703600 1 0.230 lcl sdcm 13:54:29.297718 W vdiskMDLog 1:273769776 1 0.241 lcl sdbp 13:54:29.244902 W data 51:3857589752 2056 35.566 lcl sdyi 13:54:29.244904 W data 10:3773703672 2056 28.512 lcl sdma 13:54:29.244905 W data 48:3639485944 2056 24.124 lcl sdel 13:54:29.244906 W data 25:3777897976 2056 18.691 lcl sdgt 13:54:29.244908 W data 91:3832423928 2056 20.699 lcl sdlc 13:54:29.244909 W data 115:3723372024 2056 30.783 lcl sdho 13:54:29.244910 W data 173:3882755576 2056 53.241 lcl sdti 13:54:29.244911 W data 42:3782092280 2056 22.785 lcl sddz 13:54:29.244912 W data 45:3647874552 2056 24.289 lcl sdei 13:54:29.244913 W data 32:3652068856 2056 17.220 lcl sdbn 13:54:29.244914 W data 39:3677234680 2056 26.017 lcl sddw 13:54:29.298273 W vdiskMDLog 2:144008497 1 2.522 lcl sduf 13:54:29.298274 W vdiskMDLog 0:331703601 1 1.025 lcl sdlo 13:54:29.298275 W vdiskMDLog 1:273769777 1 2.586 lcl sdtt 13:54:29.288275 W data 27:2249588200 2056 20.071 lcl sdhb 13:54:29.288279 W data 33:2224422376 2056 19.682 lcl sdts 13:54:29.288281 W data 47:2115370472 2056 21.667 lcl sdwo 13:54:29.288282 W data 82:2316697064 2056 21.524 lcl sdxy 13:54:29.288283 W data 85:2232810984 2056 17.467 lcl sdra 13:54:29.288285 W data 30:2127953384 2056 18.475 lcl sdqg 13:54:29.288286 W data 67:1876295144 2056 16.383 lcl sdmx 13:54:29.288287 W data 64:2127953384 2056 21.908 lcl sduh 13:54:29.288288 W data 38:2253782504 2056 19.775 lcl sddv 13:54:29.288290 W data 15:2207645160 2056 20.599 lcl sdet 13:54:29.288291 W data 157:2283142632 2056 21.198 lcl sdiy Bonding problem on the interfaces? Mellanox ( interface card prodicer) drivers and firmware updated, and we even tested the system with a single link ( without bonding). Could someone help me with this? in particular: * What exactly are client are looking to decide that another node is unresponsive? Ping? i dont think so because both NSD servers and clients can be pinged, so what they look? if comeone can also specify what port are they using i can try to tcpdump what exactly is cauding this expell. * How can i monitor metadata operations to understand where EXACTLY is the bottleneck that causes this: [sdinardo at ebi5-001 ~]$ time ls /gpfs/nobackup/sdinardo 1 ebi3-054.ebi.ac.uk ebi3-154 ebi5-019.ebi.ac.uk ebi5-052 ebi5-101 ebi5-156 ebi5-197 ebi5-228 ebi5-262.ebi.ac.uk 10 ebi3-055 ebi3-155 ebi5-021.ebi.ac.uk ebi5-053 ebi5-104.ebi.ac.uk ebi5-160.ebi.ac.uk ebi5-198 ebi5-229 ebi5-263 2 ebi3-056.ebi.ac.uk ebi3-156 ebi5-022 ebi5-054.ebi.ac.uk ebi5-106 ebi5-161 ebi5-200 ebi5-230.ebi.ac.uk ebi5-264 3 ebi3-057 ebi3-157 ebi5-023 ebi5-056 ebi5-109 ebi5-162.ebi.ac.uk ebi5-201 ebi5-231.ebi.ac.uk ebi5-265 4 ebi3-058 ebi3-158.ebi.ac.uk ebi5-024.ebi.ac.uk ebi5-057 ebi5-110.ebi.ac.uk ebi5-163.ebi.ac.uk ebi5-202.ebi.ac.uk ebi5-232 ebi5-266.ebi.ac.uk 5 ebi3-059.ebi.ac.uk ebi3-160 ebi5-025 ebi5-060 ebi5-111.ebi.ac.uk ebi5-164 ebi5-204 ebi5-233 ebi5-267 6 ebi3-132 ebi3-161.ebi.ac.uk ebi5-026 ebi5-061.ebi.ac.uk ebi5-112.ebi.ac.uk ebi5-165 ebi5-205 ebi5-234 ebi5-269.ebi.ac.uk 7 ebi3-133 ebi3-163.ebi.ac.uk ebi5-028 ebi5-062.ebi.ac.uk ebi5-129.ebi.ac.uk ebi5-166 ebi5-206.ebi.ac.uk ebi5-236 ebi5-270 8 ebi3-134 ebi3-165 ebi5-030 ebi5-064 ebi5-131.ebi.ac.uk ebi5-169.ebi.ac.uk ebi5-207 ebi5-237 ebi5-271 9 ebi3-135 ebi3-166.ebi.ac.uk ebi5-031 ebi5-065 ebi5-132 ebi5-170.ebi.ac.uk ebi5-209 ebi5-239.ebi.ac.uk launcher.sh _*real 21m14.948s*_( WTH ?!?!?!) user 0m0.004s sys 0m0.014s I know that the question are not easy to answer, and i need to dig more, but could be very helpful if someone give me some hints about where to look at. My gpfs skills are limited since this is our first system and is in production for just few months, and the things stated to worsen just recenlty. In past we could get over 200Gb/s ( both read and write) without any issue. Now some clients get expelled even when data thoughuput is ad 4-5Gb/s. Thanks in advance for any help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 11:18:10 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 11:18:10 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance Message-ID: Hi all, You may be aware that the website has been down for about a week now. This is due to the amount of traffic to the website and the amount of people on the mailing list, we had seen a few issues on the system. In order to counter the issues, we are moving to a new system to counter any future issues, and ease of management. We are hoping to do this tonight ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then please let me know. I will, as part of the move over, will be sending a few test mails to make sure that mailing list is working correctly. Thanks for your patience -- Arif Ali gpfsug Admin IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Tue Aug 19 12:11:00 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Tue, 19 Aug 2014 12:11:00 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F330C4.808@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 20:59:47 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 20:59:47 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: This is a test mail to the mailing list please do not reply -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 11:18, Arif Ali wrote: > Hi all, > > You may be aware that the website has been down for about a week now. This > is due to the amount of traffic to the website and the amount of people on > the mailing list, we had seen a few issues on the system. > > In order to counter the issues, we are moving to a new system to counter > any future issues, and ease of management. We are hoping to do this tonight > ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then > please let me know. > > I will, as part of the move over, will be sending a few test mails to make > sure that mailing list is working correctly. > > Thanks for your patience > > -- > Arif Ali > gpfsug Admin > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 23:41:48 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 23:41:48 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: Thanks for all your patience, The service should all be back up again -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 20:59, Arif Ali wrote: > This is a test mail to the mailing list > > please do not reply > > -- > Arif Ali > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > > > On 19 August 2014 11:18, Arif Ali wrote: > >> Hi all, >> >> You may be aware that the website has been down for about a week now. >> This is due to the amount of traffic to the website and the amount of >> people on the mailing list, we had seen a few issues on the system. >> >> In order to counter the issues, we are moving to a new system to counter >> any future issues, and ease of management. We are hoping to do this tonight >> ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then >> please let me know. >> >> I will, as part of the move over, will be sending a few test mails to >> make sure that mailing list is working correctly. >> >> Thanks for your patience >> >> -- >> Arif Ali >> gpfsug Admin >> >> IRC: arif-ali at freenode >> LinkedIn: http://uk.linkedin.com/in/arifali >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 08:57:23 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 08:57:23 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F454E3.40803@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 09:03:03 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 09:03:03 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F45637.8080000@ebi.ac.uk> Another interesting case about a specific waiter: was looking the waiters on GSS until i found those( i got those info collecting from all the servers with a script i did, so i was able to trace hanging connection while they was happening): gss03b.ebi.ac.uk:*235.373993397*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*235.152271998*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss02a.ebi.ac.uk:*214.079093620 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109 gss02a.ebi.ac.uk:*213.580199240 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109 gss03b.ebi.ac.uk:*132.375138082*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*132.374973884 *(MsgRecordCondvar), reason 'RPC wait' for commMsgCheckMessages on node 10.7.37.109 the bolted number are seconds. looking at this page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information The web page claim that's, probably a network congestion, but i managed to login quick enough to the client and there the waiters was: [root at ebi5-236 ~]# mmdiag --waiters === mmdiag: waiters === 0x7F6690073460 waiting 147.973009173 seconds, RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0 (0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for LX lock' 0x7F65100036D0 waiting 140.458589856 seconds, WritebehindWorkerThread: on ThCond 0x7F6500000F98 (0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63A0001080 waiting 245.153055801 seconds, WritebehindWorkerThread: on ThCond 0x7F65D8002CF8 (0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C03D3D0 waiting 245.750977203 seconds, CleanBufferThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674802E360 waiting 244.159861966 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0002358 (0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C038810 waiting 251.086748430 seconds, SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398 (0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait' for I/O completion on node 10.7.28.35 0x7F674C036230 waiting 139.556735095 seconds, CleanBufferThread: on ThCond 0x7F65CC004C78 (0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C031670 waiting 144.327593052 seconds, WritebehindWorkerThread: on ThCond 0x7F672402D1A8 (0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C02A4D0 waiting 145.202712821 seconds, WritebehindWorkerThread: on ThCond 0x7F65440018F8 (0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C0291E0 waiting 247.131569232 seconds, PrefetchWorkerThread: on ThCond 0x7F65740016C8 (0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748025BD0 waiting 11.631381523 seconds, replyCleanupThread: on ThCond 0x7F65E000A1F8 (0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait' 0x7F6748022300 waiting 245.616267612 seconds, WritebehindWorkerThread: on ThCond 0x7F6470001468 (0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748021010 waiting 230.769670930 seconds, InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674801B160 waiting 245.830554594 seconds, UnusedInodePrefetchThread: on ThCond 0x7F65B8004438 (0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674800A820 waiting 252.332932000 seconds, Msg handler getData: for poll on sock 109 0x7F63F4023090 waiting 253.073535042 seconds, WritebehindWorkerThread: on ThCond 0x7F65C4000CC8 (0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64A4000CE0 waiting 145.049659249 seconds, WritebehindWorkerThread: on ThCond 0x7F6560000A98 (0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6778006D00 waiting 142.124664264 seconds, WritebehindWorkerThread: on ThCond 0x7F63DC000C08 (0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780046D0 waiting 251.751439453 seconds, WritebehindWorkerThread: on ThCond 0x7F6454000A98 (0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E4B70 waiting 142.431051232 seconds, WritebehindWorkerThread: on ThCond 0x7F63C80010D8 (0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E50D0 waiting 244.339624817 seconds, WritebehindWorkerThread: on ThCond 0x7F65BC001B98 (0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6434000B40 waiting 145.343700410 seconds, WritebehindWorkerThread: on ThCond 0x7F63B00036E8 (0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F670C0187A0 waiting 244.903963969 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0000FB8 (0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04E2F0 waiting 245.837137631 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4000A98 (0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04AA20 waiting 139.713993908 seconds, WritebehindWorkerThread: on ThCond 0x7F6454002478 (0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C049730 waiting 252.434187472 seconds, WritebehindWorkerThread: on ThCond 0x7F65F4003708 (0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C044B70 waiting 131.515829048 seconds, Msg handler ccMsgPing: on ThCond 0x7F64DC1D4888 (0x7F64DC1D4888) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6758008DE0 waiting 149.548547226 seconds, Msg handler getData: on ThCond 0x7F645C002458 (0x7F645C002458) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F67580071D0 waiting 149.548543118 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F6450001C48 (0x7F6450001C48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6448001620 waiting 139.844870446 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0003098 (0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63F4000F80 waiting 245.044791905 seconds, WritebehindWorkerThread: on ThCond 0x7F6450001188 (0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F659C0033A0 waiting 243.464399305 seconds, PrefetchWorkerThread: on ThCond 0x7F6554002598 (0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6514001690 waiting 245.826160463 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4004558 (0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64800012B0 waiting 253.174835511 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0000FB8 (0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6510000EE0 waiting 140.746696039 seconds, WritebehindWorkerThread: on ThCond 0x7F647C000CC8 (0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6754001BB0 waiting 246.336055629 seconds, PrefetchWorkerThread: on ThCond 0x7F6594002498 (0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6420000930 waiting 140.606777450 seconds, WritebehindWorkerThread: on ThCond 0x7F6578002498 (0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744009110 waiting 137.466372831 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65F4007158 (0x7F65F4007158) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67280119F0 waiting 144.173427360 seconds, WritebehindWorkerThread: on ThCond 0x7F6504000AE8 (0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F672800BB40 waiting 145.804301887 seconds, WritebehindWorkerThread: on ThCond 0x7F6550001038 (0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6728000910 waiting 252.601993452 seconds, WritebehindWorkerThread: on ThCond 0x7F6450000A98 (0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744007E20 waiting 251.603329204 seconds, WritebehindWorkerThread: on ThCond 0x7F6570004C18 (0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64AC002EF0 waiting 139.205774422 seconds, FileBlockWriteFetchHandlerThread: on ThCond 0x18020AF0260 (0xFFFFC90020AF0260) (FetchFlowControlCondvar), reason 'wait for buffer for fetch' 0x7F6724013050 waiting 71.501580932 seconds, Msg handler ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F661C000DA0 waiting 245.654985276 seconds, PrefetchWorkerThread: on ThCond 0x7F6570005288 (0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00F440 waiting 251.096002003 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65BC002878 (0x7F65BC002878) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00E150 waiting 144.034006970 seconds, WritebehindWorkerThread: on ThCond 0x7F6528001548 (0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FCD20 waiting 142.324070945 seconds, WritebehindWorkerThread: on ThCond 0x7F6580002A98 (0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FA330 waiting 200.670114385 seconds, EEWatchDogThread: on ThCond 0x7F65B0000A98 (0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait' 0x7F67A02BF050 waiting 252.276161189 seconds, WritebehindWorkerThread: on ThCond 0x7F6584003998 (0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A0004160 waiting 251.173651822 seconds, SyncHandlerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write' So from the client side its the client that's waiting the server. I managed also to ping, ssh, and tcpdump each other before the node got expelled and discovered that ping works fine, ssh work fine , beside my tests there are 0 packet passing between them, LITERALLY. So there is no congestion, no network issues, but the server waits for the client and the client waits the server. This happens until we reach 350 secs ( 10 times the lease time) , then client get expelled. There are no local io waiters that indicates that gss is struggling, there is plenty of bandwith and CPU resources and no network congestion. Seems some sort of deadlock to me, but how can this be explained and hopefully fixed? Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at gpfsug.org Thu Aug 21 09:20:39 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Thu, 21 Aug 2014 09:20:39 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F5ABD7.80107@gpfsug.org> Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: > Still problems. Here some more detailed examples: > > *EXAMPLE 1:* > > *EBI5-220**( CLIENT)** > *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a > reply from node gss02b* > Tue Aug 19 11:03:04.981 2014: Request sent to > (gss02a in GSS.ebi.ac.uk) to expel (gss02b in > GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:04.982 2014: This node will be expelled > from cluster GSS.ebi.ac.uk due to expel msg from IP> (ebi5-220) > Tue Aug 19 11:03:09.319 2014: Cluster Manager connection > broke. Probing cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:03:10.322 2014: Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems. > Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount > invoked. File system: gpfs1 Reason: SGPanic > Tue Aug 19 11:03:12.066 2014: Connecting to > gss02a > Tue Aug 19 11:03:12.070 2014: Connected to > gss02a > Tue Aug 19 11:03:17.071 2014: Connecting to > gss02b > Tue Aug 19 11:03:17.072 2014: Connecting to > gss03b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss03a > Tue Aug 19 11:03:17.080 2014: Connecting to > gss01b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss01a > Tue Aug 19 11:04:23.105 2014: Connected to > gss02b > Tue Aug 19 11:04:23.107 2014: Connected to > gss03b > Tue Aug 19 11:04:23.112 2014: Connected to > gss03a > Tue Aug 19 11:04:23.115 2014: Connected to > gss01b > Tue Aug 19 11:04:23.121 2014: Connected to > gss01a > Tue Aug 19 11:12:28.992 2014: Node (gss02a in > GSS.ebi.ac.uk) is now the Group Leader. > > *GSS02B ( NSD SERVER)* > ... > Tue Aug 19 11:03:17.070 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:25.016 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:28.080 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:36.019 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:39.083 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:47.023 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:50.088 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:52.218 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:58.030 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:01.092 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:03.220 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:09.034 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:12.096 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:14.224 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:20.037 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:23.103 2014: Accepted and connected to > ** ebi5-220 > ... > > *GSS02a ( NSD SERVER)* > Tue Aug 19 11:03:04.980 2014: Expel (gss02b) > request from (ebi5-220 in > ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 > in ebi-cluster.ebi.ac.uk) > Tue Aug 19 11:03:12.069 2014: Accepted and connected to > ebi5-220 > > > =============================================== > *EXAMPLE 2*: > > *EBI5-038* > Tue Aug 19 11:32:34.227 2014: *Disk lease period expired > in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* > Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing > cluster GSS.ebi.ac.uk* > Tue Aug 19 11:35:24.265 2014: Close connection to IP> gss02a (Connection reset by peer). Attempting > reconnect. > Tue Aug 19 11:35:24.865 2014: Close connection to > ebi5-014 (Connection reset by > peer). Attempting reconnect. > ... > LOT MORE RESETS BY PEER > ... > Tue Aug 19 11:35:25.096 2014: Close connection to > ebi5-167 (Connection reset by > peer). Attempting reconnect. > Tue Aug 19 11:35:25.267 2014: Connecting to > gss02a > Tue Aug 19 11:35:25.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:35:26.267 2014: Retry connection to IP> gss02a > Tue Aug 19 11:35:26.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems.* > > *GSS02a* > Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 > in ebi-cluster.ebi.ac.uk) *is being expelled because of an > expired lease.* Pings sent: 60. Replies received: 60. > > > > > In example 1 seems that an NSD was not repliyng to the client, but the > servers seems working fine.. how can i trace better ( to solve) the > problem? > > In example 2 it seems to me that for some reason the manager are not > renewing the lease in time. when this happens , its not a single client. > Loads of them fail to get the lease renewed. Why this is happening? > how can i trace to the source of the problem? > > > > Thanks in advance for any tips. > > Regards, > Salvatore > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 10:04:47 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 10:04:47 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5ABD7.80107@gpfsug.org> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> Message-ID: <53F5B62F.1060305@ebi.ac.uk> Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called */input_file/* of nearly 100GB: if from *client A* we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) *client B* : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: > Hi there, > > I've seen the on several 'stock'? 'core'? GPFS system (we need a > better term now GSS is out) and seen ping 'working', but alongside > ejections from the cluster. > The GPFS internode 'ping' is somewhat more circumspect than unix ping > - and rightly so. > > In my experience this has _always_ been a network issue of one sort of > another. If the network is experiencing issues, nodes will be ejected. > Of course it could be unresponsive mmfsd or high loadavg, but I've > seen that only twice in 10 years over many versions of GPFS. > > You need to follow the logs through from each machine in time order to > determine who could not see who and in what order. > Your best way forward is to log a SEV2 case with IBM support, directly > or via your OEM and collect and supply a snap and traces as required > by support. > > Without knowing your full setup, it's hard to help further. > > Jez > > On 20/08/14 08:57, Salvatore Di Nardo wrote: >> Still problems. Here some more detailed examples: >> >> *EXAMPLE 1:* >> >> *EBI5-220**( CLIENT)** >> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >> reply from node gss02b* >> Tue Aug 19 11:03:04.981 2014: Request sent to >> (gss02a in GSS.ebi.ac.uk) to expel (gss02b in >> GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:04.982 2014: This node will be expelled >> from cluster GSS.ebi.ac.uk due to expel msg from >> (ebi5-220) >> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >> broke. Probing cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:10.321 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems. >> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >> invoked. File system: gpfs1 Reason: SGPanic >> Tue Aug 19 11:03:12.066 2014: Connecting to >> gss02a >> Tue Aug 19 11:03:12.070 2014: Connected to >> gss02a >> Tue Aug 19 11:03:17.071 2014: Connecting to >> gss02b >> Tue Aug 19 11:03:17.072 2014: Connecting to >> gss03b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss03a >> Tue Aug 19 11:03:17.080 2014: Connecting to >> gss01b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss01a >> Tue Aug 19 11:04:23.105 2014: Connected to >> gss02b >> Tue Aug 19 11:04:23.107 2014: Connected to >> gss03b >> Tue Aug 19 11:04:23.112 2014: Connected to >> gss03a >> Tue Aug 19 11:04:23.115 2014: Connected to >> gss01b >> Tue Aug 19 11:04:23.121 2014: Connected to >> gss01a >> Tue Aug 19 11:12:28.992 2014: Node (gss02a in >> GSS.ebi.ac.uk) is now the Group Leader. >> >> *GSS02B ( NSD SERVER)* >> ... >> Tue Aug 19 11:03:17.070 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:25.016 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:28.080 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:36.019 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:39.083 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:47.023 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:50.088 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:52.218 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:58.030 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:01.092 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:03.220 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:09.034 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:12.096 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:14.224 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:20.037 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >> ** ebi5-220 >> ... >> >> *GSS02a ( NSD SERVER)* >> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >> request from (ebi5-220 in >> ebi-cluster.ebi.ac.uk). Expelling: >> (ebi5-220 in ebi-cluster.ebi.ac.uk) >> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >> ebi5-220 >> >> >> =============================================== >> *EXAMPLE 2*: >> >> *EBI5-038* >> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >> cluster GSS.ebi.ac.uk* >> Tue Aug 19 11:35:24.265 2014: Close connection to > IP> gss02a (Connection reset by peer). Attempting >> reconnect. >> Tue Aug 19 11:35:24.865 2014: Close connection to >> ebi5-014 (Connection reset by >> peer). Attempting reconnect. >> ... >> LOT MORE RESETS BY PEER >> ... >> Tue Aug 19 11:35:25.096 2014: Close connection to >> ebi5-167 (Connection reset by >> peer). Attempting reconnect. >> Tue Aug 19 11:35:25.267 2014: Connecting to >> gss02a >> Tue Aug 19 11:35:25.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:35:26.267 2014: Retry connection to > IP> gss02a >> Tue Aug 19 11:35:26.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:36:24.276 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems.* >> >> *GSS02a* >> Tue Aug 19 11:35:24.263 2014: Node >> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >> because of an expired lease.* Pings sent: 60. Replies >> received: 60. >> >> >> >> >> In example 1 seems that an NSD was not repliyng to the client, but >> the servers seems working fine.. how can i trace better ( to solve) >> the problem? >> >> In example 2 it seems to me that for some reason the manager are not >> renewing the lease in time. when this happens , its not a single client. >> Loads of them fail to get the lease renewed. Why this is happening? >> how can i trace to the source of the problem? >> >> >> >> Thanks in advance for any tips. >> >> Regards, >> Salvatore >> >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Aug 21 13:48:38 2014 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 21 Aug 2014 12:48:38 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>,<53F5B62F.1060305@ebi.ac.uk> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. [cid:2fb2253c-3ffb-4ac6-88a8-d019b1a24f66] Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From jbernard at jumptrading.com Thu Aug 21 13:52:05 2014 From: jbernard at jumptrading.com (Jon Bernard) Date: Thu, 21 Aug 2014 12:52:05 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk>, <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: Where is that from? On Aug 21, 2014, at 7:49, "Bryan Banister" > wrote: As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From viccornell at gmail.com Thu Aug 21 14:03:14 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:03:14 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Hi Salvatore, Are you using ethernet or infiniband as the GPFS interconnect to your clients? If 10/40GbE - do you have a separate admin network? I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. Vic On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: > Thanks for the feedback, but we managed to find a scenario that excludes network problems. > > we have a file called input_file of nearly 100GB: > > if from client A we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on the same client) client B : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. > Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. > > > > As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. > Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> EXAMPLE 1: >>> EBI5-220 ( CLIENT) >>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> GSS02B ( NSD SERVER) >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>> ... >>> >>> GSS02a ( NSD SERVER) >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>> >>> >>> =============================================== >>> EXAMPLE 2: >>> >>> EBI5-038 >>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> >>> GSS02a >>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:04:59 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:04:59 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk> <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: <53F5EE7B.2080306@ebi.ac.uk> Thanks for the info... it helps a bit understanding whats going on, but i think you missed the part that Node A and Node B could also be the same machine. If for instance i ran 2 cp on the same machine, hence Client B cannot have problems contacting Client A since they are the same machine..... BTW i did the same also using 2 clients and the result its the same. Nonetheless your description is made me understand a bit better what's going on Regards, Salvatore On 21/08/14 13:48, Bryan Banister wrote: > As I understand GPFS distributed locking semantics, GPFS will not > allow one node to hold a write lock for a file indefinitely. Once > Client B opens the file for writing it would have contacted the File > System Manager to obtain the lock. The FS manager would have told > Client B that Client A has the lock and that Client B would have to > contact Client A and revoke the write lock token. If Client A does > not respond to Client B's request to revoke the write token, then > Client B will ask that Client A be expelled from the cluster for NOT > adhering to the proper protocol for write lock contention. > > > > Have you checked the communication path between the two clients at > this point? > > I could not follow the logs that you provided. You should definitely > look at the exact sequence of log events on the two clients and the > file system manager (as reported by mmlsmgr). > > Hope that helps, > -Bryan > > ------------------------------------------------------------------------ > *From:* gpfsug-discuss-bounces at gpfsug.org > [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo > [sdinardo at ebi.ac.uk] > *Sent:* Thursday, August 21, 2014 4:04 AM > *To:* chair at gpfsug.org; gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] gpfs client expels > > Thanks for the feedback, but we managed to find a scenario that > excludes network problems. > > we have a file called */input_file/* of nearly 100GB: > > if from *client A* we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they > flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on > the same client) *client B* : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all > the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for > one of the "cat", so have to wait ( and since the file is BIG , have > to wait for a while), its not understandable why it stop the renewal > lease. > Why its doen't return just a timeout error on the copy instead to > expel the node? We can reproduce this every time, and since our users > to operations like this on files over 100GB each you can imagine the > result. > > > > As you can imagine even if its a bit silly to write at the same time > to the same destination, its also quite common if we want to dump to a > log file logs and for some reason one of the writers, write for a lot > of time keeping the file locked. > Our expels are not due to network congestion, but because a write > attempts have to wait another one. What i really dont understand is > why to take a so expreme mesure to expell jest because a process is > waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under > investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a >> better term now GSS is out) and seen ping 'working', but alongside >> ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping >> - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort >> of another. If the network is experiencing issues, nodes will be >> ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've >> seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order >> to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, >> directly or via your OEM and collect and supply a snap and traces as >> required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> *EXAMPLE 1:* >>> >>> *EBI5-220**( CLIENT)** >>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>> reply from node gss02b* >>> Tue Aug 19 11:03:04.981 2014: Request sent to >> IP> (gss02a in GSS.ebi.ac.uk) to expel >>> (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled >>> from cluster GSS.ebi.ac.uk due to expel msg from >>> (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >>> broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >>> GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>> invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to >>> gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to >>> gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to >>> gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to >>> gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to >>> gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to >>> gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to >>> gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to >>> gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to >>> gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>> in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> *GSS02B ( NSD SERVER)* >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>> ** ebi5-220 >>> ... >>> >>> *GSS02a ( NSD SERVER)* >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >>> request from (ebi5-220 in >>> ebi-cluster.ebi.ac.uk). Expelling: >>> (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>> ebi5-220 >>> >>> >>> =============================================== >>> *EXAMPLE 2*: >>> >>> *EBI5-038* >>> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >>> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >>> cluster GSS.ebi.ac.uk* >>> Tue Aug 19 11:35:24.265 2014: Close connection to >>> gss02a (Connection reset by peer). >>> Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to >>> ebi5-014 (Connection reset by >>> peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to >>> ebi5-167 (Connection reset by >>> peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>> gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>> cluster GSS.ebi.ac.uk. Unmounting file systems.* >>> >>> *GSS02a* >>> Tue Aug 19 11:35:24.263 2014: Node >>> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >>> because of an expired lease.* Pings sent: 60. Replies >>> received: 60. >>> >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but >>> the servers seems working fine.. how can i trace better ( to solve) >>> the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not >>> renewing the lease in time. when this happens , its not a single >>> client. >>> Loads of them fail to get the lease renewed. Why this is happening? >>> how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged > information. If you are not the intended recipient, you are hereby > notified that any review, dissemination or copying of this email is > strictly prohibited, and to please notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does > not make any guarantees as to the completeness or accuracy of this > email or any attachments. This email is for informational purposes > only and does not constitute a recommendation, offer, request or > solicitation of any kind to buy, sell, subscribe, redeem or perform > any type of transaction of a financial product. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 249179 bytes Desc: not available URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:18:19 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:18:19 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Message-ID: <53F5F19B.1010603@ebi.ac.uk> This is an interesting point! We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. I will forward this consideration to our network admin , so see if we can work on a dedicated network. thanks for your tip. Regards, Salvatore On 21/08/14 14:03, Vic Cornell wrote: > Hi Salvatore, > > Are you using ethernet or infiniband as the GPFS interconnect to your > clients? > > If 10/40GbE - do you have a separate admin network? > > I have seen behaviour similar to this where the storage traffic causes > congestion and the "admin" traffic gets lost or delayed causing expels. > > Vic > > > > On 21 Aug 2014, at 10:04, Salvatore Di Nardo > wrote: > >> Thanks for the feedback, but we managed to find a scenario that >> excludes network problems. >> >> we have a file called */input_file/* of nearly 100GB: >> >> if from *client A* we do: >> >> cat input_file >> output_file >> >> it start copying.. and we see waiter goeg a bit up,secs but then they >> flushes back to 0, so we xcan say that the copy proceed well... >> >> >> if now we do the same from another client ( or just another shell on >> the same client) *client B* : >> >> cat input_file >> output_file >> >> >> ( in other words we are trying to write to the same destination) all >> the waiters gets up until one node get expelled. >> >> >> Now, while its understandable that the destination file is locked for >> one of the "cat", so have to wait ( and since the file is BIG , have >> to wait for a while), its not understandable why it stop the renewal >> lease. >> Why its doen't return just a timeout error on the copy instead to >> expel the node? We can reproduce this every time, and since our users >> to operations like this on files over 100GB each you can imagine the >> result. >> >> >> >> As you can imagine even if its a bit silly to write at the same time >> to the same destination, its also quite common if we want to dump to >> a log file logs and for some reason one of the writers, write for a >> lot of time keeping the file locked. >> Our expels are not due to network congestion, but because a write >> attempts have to wait another one. What i really dont understand is >> why to take a so expreme mesure to expell jest because a process is >> waiteing "to too much time". >> >> >> I have ticket opened to IBM for this and the issue is under >> investigation, but no luck so far.. >> >> Regards, >> Salvatore >> >> >> >> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>> Hi there, >>> >>> I've seen the on several 'stock'? 'core'? GPFS system (we need a >>> better term now GSS is out) and seen ping 'working', but alongside >>> ejections from the cluster. >>> The GPFS internode 'ping' is somewhat more circumspect than unix >>> ping - and rightly so. >>> >>> In my experience this has _always_ been a network issue of one sort >>> of another. If the network is experiencing issues, nodes will be >>> ejected. >>> Of course it could be unresponsive mmfsd or high loadavg, but I've >>> seen that only twice in 10 years over many versions of GPFS. >>> >>> You need to follow the logs through from each machine in time order >>> to determine who could not see who and in what order. >>> Your best way forward is to log a SEV2 case with IBM support, >>> directly or via your OEM and collect and supply a snap and traces as >>> required by support. >>> >>> Without knowing your full setup, it's hard to help further. >>> >>> Jez >>> >>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>> Still problems. Here some more detailed examples: >>>> >>>> *EXAMPLE 1:* >>>> >>>> *EBI5-220**( CLIENT)** >>>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>>> reply from node gss02b* >>>> Tue Aug 19 11:03:04.981 2014: Request sent to >>> IP> (gss02a in GSS.ebi.ac.uk ) to >>>> expel (gss02b in GSS.ebi.ac.uk >>>> ) from cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:04.982 2014: This node will be >>>> expelled from cluster GSS.ebi.ac.uk >>>> due to expel msg from >>> IP> (ebi5-220) >>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager >>>> connection broke. Probing cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:03:10.322 2014: Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems. >>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>>> invoked. File system: gpfs1 Reason: SGPanic >>>> Tue Aug 19 11:03:12.066 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:03:12.070 2014: Connected to >>>> gss02a >>>> Tue Aug 19 11:03:17.071 2014: Connecting to >>>> gss02b >>>> Tue Aug 19 11:03:17.072 2014: Connecting to >>>> gss03b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss03a >>>> Tue Aug 19 11:03:17.080 2014: Connecting to >>>> gss01b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss01a >>>> Tue Aug 19 11:04:23.105 2014: Connected to >>>> gss02b >>>> Tue Aug 19 11:04:23.107 2014: Connected to >>>> gss03b >>>> Tue Aug 19 11:04:23.112 2014: Connected to >>>> gss03a >>>> Tue Aug 19 11:04:23.115 2014: Connected to >>>> gss01b >>>> Tue Aug 19 11:04:23.121 2014: Connected to >>>> gss01a >>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>>> in GSS.ebi.ac.uk ) is now the >>>> Group Leader. >>>> >>>> *GSS02B ( NSD SERVER)* >>>> ... >>>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>>> ** ebi5-220 >>>> ... >>>> >>>> *GSS02a ( NSD SERVER)* >>>> Tue Aug 19 11:03:04.980 2014: Expel >>>> (gss02b) request from (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ). >>>> Expelling: (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ) >>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>>> ebi5-220 >>>> >>>> >>>> =============================================== >>>> *EXAMPLE 2*: >>>> >>>> *EBI5-038* >>>> Tue Aug 19 11:32:34.227 2014: *Disk lease period >>>> expired in cluster GSS.ebi.ac.uk >>>> . Attempting to reacquire lease.* >>>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. >>>> Probing cluster GSS.ebi.ac.uk * >>>> Tue Aug 19 11:35:24.265 2014: Close connection to >>>> gss02a (Connection reset by peer). >>>> Attempting reconnect. >>>> Tue Aug 19 11:35:24.865 2014: Close connection to >>>> ebi5-014 (Connection reset by >>>> peer). Attempting reconnect. >>>> ... >>>> LOT MORE RESETS BY PEER >>>> ... >>>> Tue Aug 19 11:35:25.096 2014: Close connection to >>>> ebi5-167 (Connection reset by >>>> peer). Attempting reconnect. >>>> Tue Aug 19 11:35:25.267 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:35:25.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>>> gss02a >>>> Tue Aug 19 11:35:26.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems.* >>>> >>>> *GSS02a* >>>> Tue Aug 19 11:35:24.263 2014: Node >>>> (ebi5-038 in ebi-cluster.ebi.ac.uk >>>> ) *is being expelled >>>> because of an expired lease.* Pings sent: 60. Replies >>>> received: 60. >>>> >>>> >>>> >>>> >>>> In example 1 seems that an NSD was not repliyng to the client, but >>>> the servers seems working fine.. how can i trace better ( to solve) >>>> the problem? >>>> >>>> In example 2 it seems to me that for some reason the manager are >>>> not renewing the lease in time. when this happens , its not a >>>> single client. >>>> Loads of them fail to get the lease renewed. Why this is happening? >>>> how can i trace to the source of the problem? >>>> >>>> >>>> >>>> Thanks in advance for any tips. >>>> >>>> Regards, >>>> Salvatore >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss atgpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss atgpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Aug 21 14:19:33 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 21 Aug 2014 15:19:33 +0200 (CEST) Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <1481989063.92260.1408627173332.open-xchange@oxbaltgw09.schlund.de> > Now, while its understandable that the destination file is locked for one of > the "cat", so have to wait If GPFS is posix compatible i do not understand why a cat should block the other cat completly meanings on a standard FS you can "cat" from many source to the same target. Of course the result is not predictable. >From this point of view i would expect that both "cat" would start writing immediately thus i would expect a GPFS bug. All imho. Hajo Note: You might test which the input_file in a different directory and i would test the behaviour if the output_file is on a local FS like /tmp. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Thu Aug 21 14:22:22 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:22:22 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5F19B.1010603@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> <53F5F19B.1010603@ebi.ac.uk> Message-ID: <0F03996A-2008-4076-9A2B-B4B2BB89E959@gmail.com> For my system I always use a dedicated admin network - as described in the gpfs manuals - for a gpfs cluster on 10/40GbE where the system will be heavily loaded. The difference in the stability of the system is very noticeable. Not sure how/if this would work on GSS - IBM ought to know :-) Vic On 21 Aug 2014, at 14:18, Salvatore Di Nardo wrote: > This is an interesting point! > > We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. > > Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. > > I will forward this consideration to our network admin , so see if we can work on a dedicated network. > > thanks for your tip. > > Regards, > Salvatore > > > > > On 21/08/14 14:03, Vic Cornell wrote: >> Hi Salvatore, >> >> Are you using ethernet or infiniband as the GPFS interconnect to your clients? >> >> If 10/40GbE - do you have a separate admin network? >> >> I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. >> >> Vic >> >> >> >> On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: >> >>> Thanks for the feedback, but we managed to find a scenario that excludes network problems. >>> >>> we have a file called input_file of nearly 100GB: >>> >>> if from client A we do: >>> >>> cat input_file >> output_file >>> >>> it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... >>> >>> >>> if now we do the same from another client ( or just another shell on the same client) client B : >>> >>> cat input_file >> output_file >>> >>> >>> ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. >>> >>> >>> Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. >>> Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. >>> >>> >>> >>> As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. >>> Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". >>> >>> >>> I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>>> Hi there, >>>> >>>> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >>>> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >>>> >>>> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >>>> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >>>> >>>> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >>>> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >>>> >>>> Without knowing your full setup, it's hard to help further. >>>> >>>> Jez >>>> >>>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>>> Still problems. Here some more detailed examples: >>>>> >>>>> EXAMPLE 1: >>>>> EBI5-220 ( CLIENT) >>>>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>>>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>>>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>>>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>>>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>>>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>>>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>>>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>>>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>>>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>>>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>>>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>>>> >>>>> GSS02B ( NSD SERVER) >>>>> ... >>>>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>>>> ... >>>>> >>>>> GSS02a ( NSD SERVER) >>>>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>>>> >>>>> >>>>> =============================================== >>>>> EXAMPLE 2: >>>>> >>>>> EBI5-038 >>>>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>>>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>>>> ... >>>>> LOT MORE RESETS BY PEER >>>>> ... >>>>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>>>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>>>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> >>>>> GSS02a >>>>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>>>> >>>>> >>>>> >>>>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>>>> >>>>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>>>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>>>> >>>>> >>>>> >>>>> Thanks in advance for any tips. >>>>> >>>>> Regards, >>>>> Salvatore >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 22 10:37:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 22 Aug 2014 10:37:42 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F70F66.2010405@ebi.ac.uk> Hello everyone, Just to let you know, we found the cause of our problems. We discovered that not all of the recommend kernel setting was configured on the clients ( on server was everything ok, but the clients had some setting missing ), and IBM support pointed to this document that describes perfectly our issues and the fix wich suggest to raise some parameters even higher than the standard "best practice" : http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5091222 Thanks to everyone for the replies. Regards, Salvatore From ewahl at osc.edu Mon Aug 25 19:55:08 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 25 Aug 2014 18:55:08 +0000 Subject: [gpfsug-discuss] CNFS using NFS over RDMA? Message-ID: Anyone out there doing CNFS with NFS over RDMA? Is this even possible? We currently have been delivering some CNFS services using TCP over IB, but that layer tends to have a large number of bugs all the time. Like to take a look at moving back down to verbs... Ed Wahl OSC -------------- next part -------------- An HTML attachment was scrubbed... URL: From zander at ebi.ac.uk Fri Aug 1 14:44:49 2014 From: zander at ebi.ac.uk (Zander Mears) Date: Fri, 01 Aug 2014 14:44:49 +0100 Subject: [gpfsug-discuss] Hello! In-Reply-To: <53D981EF.3020000@gpfsug.org> References: <53D8C897.9000902@ebi.ac.uk> <53D981EF.3020000@gpfsug.org> Message-ID: <53DB99D1.8050304@ebi.ac.uk> Hi Jez We're just monitoring the standard OS stuff, some interface errors, throughput, number of network and gpfs connections due to previous issues. We don't really know as yet what is good to monitor GPFS wise. cheers Zander On 31/07/2014 00:38, Jez Tucker (Chair) wrote: > Hi Zander, > > We have a git repository. Would you be interested in adding any > Zabbix custom metrics gathering to GPFS to it? > > https://github.com/gpfsug/gpfsug-tools > > Best, > > Jez From sfadden at us.ibm.com Tue Aug 5 18:55:20 2014 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 5 Aug 2014 10:55:20 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are doing it, I am looking for some confirmation. Thanks Scott Fadden GPFS Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/gpfs -------------- next part -------------- An HTML attachment was scrubbed... URL: From u.sibiller at science-computing.de Wed Aug 6 08:46:31 2014 From: u.sibiller at science-computing.de (Ulrich Sibiller) Date: Wed, 06 Aug 2014 09:46:31 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1DD57.90103@science-computing.de> Am 05.08.2014 19:55, schrieb Scott Fadden: > Is anyone running GPFS and Lustre on the same nodes. I have seen it work, I have heard people are > doing it, I am looking for some confirmation. I have some nodes running lustre 2.1.6 or 2.5.58 and gpfs 3.5.0.17 on RHEL5.8 and RHEL6.5. None of them are servers. Kind regards, Ulrich Sibiller -- ______________________________________creating IT solutions Dipl.-Inf. Ulrich Sibiller science + computing ag System Administration Hagellocher Weg 73 mail nfz at science-computing.de 72070 Tuebingen, Germany hotline +49 7071 9457 674 http://www.science-computing.de -- Vorstandsvorsitzender/Chairman of the board of management: Gerd-Lothar Leonhart Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 From frederik.ferner at diamond.ac.uk Wed Aug 6 10:19:35 2014 From: frederik.ferner at diamond.ac.uk (Frederik Ferner) Date: Wed, 6 Aug 2014 10:19:35 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: Message-ID: <53E1F327.1000605@diamond.ac.uk> On 05/08/14 18:55, Scott Fadden wrote: > Is anyone running GPFS and Lustre on the same nodes. I have seen it > work, I have heard people are doing it, I am looking for some confirmation. Most of our compute cluster nodes are clients for Lustre and GPFS at the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on servers (GPFS NSD server or Lustre OSS/MDS servers). HTH, Frederik -- Frederik Ferner Senior Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From sdinardo at ebi.ac.uk Wed Aug 6 10:57:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 06 Aug 2014 10:57:44 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1F327.1000605@diamond.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> Message-ID: <53E1FC18.6080707@ebi.ac.uk> Sorry for this little ot, but recetly i'm looking to Lustre to understand how it is comparable to GPFS in terms of performance, reliability and easy to use. Could anyone share their experience ? My company just recently got a first GPFS system , based on IBM GSS, but while its good performance wise, there are few unresolved problems and the IBM support is almost unexistent, so I'm starting to wonder if its work to look somewhere else eventual future purchases. Salvatore On 06/08/14 10:19, Frederik Ferner wrote: > On 05/08/14 18:55, Scott Fadden wrote: >> Is anyone running GPFS and Lustre on the same nodes. I have seen it >> work, I have heard people are doing it, I am looking for some >> confirmation. > > Most of our compute cluster nodes are clients for Lustre and GPFS at > the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on > servers (GPFS NSD server or Lustre OSS/MDS servers). > > HTH, > Frederik > From chair at gpfsug.org Wed Aug 6 11:19:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 06 Aug 2014 11:19:24 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E1FC18.6080707@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> Message-ID: <53E2012C.9040402@gpfsug.org> "IBM support is almost unexistent" I don't find that at all. Do you log directly via ESC or via your OEM/integrator or are you only referring to GSS support rather than pure GPFS? If you are having response issues, your IBM rep (or a few folks on here) can accelerate issues for you. Jez On 06/08/14 10:57, Salvatore Di Nardo wrote: > Sorry for this little ot, but recetly i'm looking to Lustre to > understand how it is comparable to GPFS in terms of performance, > reliability and easy to use. > Could anyone share their experience ? > > My company just recently got a first GPFS system , based on IBM GSS, > but while its good performance wise, there are few unresolved problems > and the IBM support is almost unexistent, so I'm starting to wonder if > its work to look somewhere else eventual future purchases. > > > Salvatore > > On 06/08/14 10:19, Frederik Ferner wrote: >> On 05/08/14 18:55, Scott Fadden wrote: >>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>> work, I have heard people are doing it, I am looking for some >>> confirmation. >> >> Most of our compute cluster nodes are clients for Lustre and GPFS at >> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >> servers (GPFS NSD server or Lustre OSS/MDS servers). >> >> HTH, >> Frederik >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Wed Aug 6 14:26:47 2014 From: service at metamodul.com (service at metamodul.com) Date: Wed, 6 Aug 2014 15:26:47 +0200 (CEST) Subject: [gpfsug-discuss] Hi , i am new to this list Message-ID: <1366482624.222989.1407331607965.open-xchange@oxbaltgw55.schlund.de> Hi @ALL i am Hajo Ehlers , an AIX and GPFS specialist ( Unix System Engineer ). You find me at the IBM GPFS Forum and sometimes at news:c.u.a and I am addicted to cluster filesystems My latest idee is an SAP-HANA light system ( DBMS on an in-memory cluster posix FS ) which could be extended to a "reinvented" Cluster based AS/400 ^_^ I wrote also a small script to do a sequential backup of GPFS filesystems since i got never used to mmbackup - i named it "pdsmc" for parallel dsmc". Cheers Hajo BTW: Please let me know - service (at) metamodul (dot) com - In case somebody is looking for a GPFS specialist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 8 10:53:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 08 Aug 2014 10:53:36 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E2012C.9040402@gpfsug.org> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> Message-ID: <53E49E20.1090905@ebi.ac.uk> Well, i didn't wanted to start a rant against IBM, and I'm referring specifically to GSS. Since GSS its an appliance, we have to refer to GSS support for both hardware and software issues. Hardware support in total crap. It took 1 mounth of chasing and shouting to get a drawer replacement that was causing some issues. Meanwhile 10 disks in that drawer got faulty. Finally we got the drawer replace but the disks are still faulty. Now its 3 days i'm triing to get them fixed or replaced ( its not clear if they disks are broken of they was just marked to be replaced because of the drawer). Right now i dont have any answer about how to put them online ( mmchcarrier don't work because it recognize that the disk where not replaced) There are also few other cases ( gpfs related) open that are still not answered. I have no experience with direct GPFS support, but if i open a case to GSS for a GPFS problem, the cases seems never get an answer. The only reason that GSS is working its because _*I*_**installed it spending few months studying gpfs. So now I'm wondering if its worth at all rely in future on the whole appliance concept. I'm wondering if in future its better just purchase the hardware and install GPFS by our own, or in alternatively even try Lustre. Now, skipping all this GSS rant, which have nothing to do with the file system anyway and going back to my question: Could someone point the main differences between GPFS and Lustre? I found some documentation about Lustre and i'm going to have a look, but oddly enough have not found any practical comparison between them. On 06/08/14 11:19, Jez Tucker (Chair) wrote: > "IBM support is almost unexistent" > > I don't find that at all. > Do you log directly via ESC or via your OEM/integrator or are you only > referring to GSS support rather than pure GPFS? > > If you are having response issues, your IBM rep (or a few folks on > here) can accelerate issues for you. > > Jez > > > On 06/08/14 10:57, Salvatore Di Nardo wrote: >> Sorry for this little ot, but recetly i'm looking to Lustre to >> understand how it is comparable to GPFS in terms of performance, >> reliability and easy to use. >> Could anyone share their experience ? >> >> My company just recently got a first GPFS system , based on IBM GSS, >> but while its good performance wise, there are few unresolved >> problems and the IBM support is almost unexistent, so I'm starting to >> wonder if its work to look somewhere else eventual future purchases. >> >> >> Salvatore >> >> On 06/08/14 10:19, Frederik Ferner wrote: >>> On 05/08/14 18:55, Scott Fadden wrote: >>>> Is anyone running GPFS and Lustre on the same nodes. I have seen it >>>> work, I have heard people are doing it, I am looking for some >>>> confirmation. >>> >>> Most of our compute cluster nodes are clients for Lustre and GPFS at >>> the same time. Lustre 1.8.9-wc1 and GPFS 3.5.0.11. Nothing shared on >>> servers (GPFS NSD server or Lustre OSS/MDS servers). >>> >>> HTH, >>> Frederik >>> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpro at bas.ac.uk Fri Aug 8 12:40:00 2014 From: jpro at bas.ac.uk (Jeremy Robst) Date: Fri, 8 Aug 2014 12:40:00 +0100 (BST) Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E49E20.1090905@ebi.ac.uk> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > Now, skipping all this GSS rant, which have nothing to do with the file > system anyway? and? going back to my question: > > Could someone point the main differences between GPFS and Lustre? I'm looking at making the same decision here - to buy GPFS or to roll our own Lustre configuration. I'm in the process of setting up test systems, and so far the main difference seems to be in the that in GPFS each server sees the full filesystem, and so you can run other applications (e.g backup) on a GPFS server whereas the Luste OSS (object storage servers) see only a portion of the storage (the filesystem is striped across the OSSes), so you need a Lustre client to mount the full filesystem for things like backup. However I have very little practical experience of either and would also be interested in any comments. Thanks Jeremy -- jpro at bas.ac.uk | (work) 01223 221402 (fax) 01223 362616 Unix System Administrator - British Antarctic Survey #include From keith at ocf.co.uk Fri Aug 8 14:12:39 2014 From: keith at ocf.co.uk (Keith Vickers) Date: Fri, 8 Aug 2014 14:12:39 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node Message-ID: http://www.pdsw.org/pdsw10/resources/posters/parallelNASFSs.pdf Has a good direct apples to apples comparison between Lustre and GPFS. It's pretty much abstractable from the hardware used. Keith Vickers Business Development Manager OCF plc Mobile: 07974 397863 From sergi.more at bsc.es Fri Aug 8 14:14:33 2014 From: sergi.more at bsc.es (=?ISO-8859-1?Q?Sergi_Mor=E9_Codina?=) Date: Fri, 08 Aug 2014 15:14:33 +0200 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> Message-ID: <53E4CD39.7080808@bsc.es> Hi all, About main differences between GPFS and Lustre, here you have some bits from our experience: -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. Best Regards, Sergi. On 08/08/2014 01:40 PM, Jeremy Robst wrote: > On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: > >> Now, skipping all this GSS rant, which have nothing to do with the file >> system anyway and going back to my question: >> >> Could someone point the main differences between GPFS and Lustre? > > I'm looking at making the same decision here - to buy GPFS or to roll > our own Lustre configuration. I'm in the process of setting up test > systems, and so far the main difference seems to be in the that in GPFS > each server sees the full filesystem, and so you can run other > applications (e.g backup) on a GPFS server whereas the Luste OSS (object > storage servers) see only a portion of the storage (the filesystem is > striped across the OSSes), so you need a Lustre client to mount the full > filesystem for things like backup. > > However I have very little practical experience of either and would also > be interested in any comments. > > Thanks > > Jeremy > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- ------------------------------------------------------------------------ Sergi More Codina Barcelona Supercomputing Center Centro Nacional de Supercomputacion WWW: http://www.bsc.es Tel: +34-93-405 42 27 e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 ------------------------------------------------------------------------ WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3242 bytes Desc: S/MIME Cryptographic Signature URL: From viccornell at gmail.com Fri Aug 8 18:15:30 2014 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 8 Aug 2014 18:15:30 +0100 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <53E4CD39.7080808@bsc.es> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> Message-ID: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at us.ibm.com Fri Aug 8 20:09:44 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 8 Aug 2014 12:09:44 -0700 Subject: [gpfsug-discuss] GPFS and Lustre on same node In-Reply-To: <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> References: <53E1F327.1000605@diamond.ac.uk> <53E1FC18.6080707@ebi.ac.uk> <53E2012C.9040402@gpfsug.org> <53E49E20.1090905@ebi.ac.uk> <53E4CD39.7080808@bsc.es> <4001D2D9-5E74-4EF9-908F-5B0E3443EA5B@gmail.com> Message-ID: Vic, Sergi, you can not compare Lustre and GPFS without providing a clear usecase as otherwise you compare apple with oranges. the reason for this is quite simple, Lustre plays well in pretty much one usecase - HPC, GPFS on the other hand is used in many forms of deployments from Storage for Virtual Machines, HPC, Scale-Out NAS, Solutions in digital media, to hosting some of the biggest, most business critical Transactional database installations in the world. you look at 2 products with completely different usability spectrum, functions and features unless as said above you narrow it down to a very specific usecase with a lot of details. even just HPC has a very large spectrum and not everybody is working in a single directory, which is the main scale point for Lustre compared to GPFS and the reason is obvious, if you have only 1 active metadata server (which is what 99% of all lustre systems run) some operations like single directory contention is simpler to make fast, but only up to the limit of your one node, but what happens when you need to go beyond that and only a real distributed architecture can support your workload ? for example look at most chip design workloads, which is a form of HPC, it is something thats extremely metadata and small file dominated, you talk about 100's of millions (in some cases even billions) of files, majority of them <4k, the rest larger files , majority of it with random access patterns that benefit from massive client side caching and distributed data coherency models supported by GPFS token manager infrastructure across 10's or 100's of metadata server and 1000's of compute nodes. you also need to look at the rich feature set GPFS provides, which not all may be important for some environments but are for others like Snapshot, Clones, Hierarchical Storage Management (ILM) , Local Cache acceleration (LROC), Global Namespace Wan Integration (AFM), Encryption, etc just to name a few. Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Vic Cornell To: gpfsug main discussion list Date: 08/08/2014 10:16 AM Subject: Re: [gpfsug-discuss] GPFS and Lustre on same node Sent by: gpfsug-discuss-bounces at gpfsug.org Disclaimers - I work for DDN - we sell lustre and GPFS. I know GPFS much better than I know Lustre. The biggest difference we find between GPFS and Lustre is that GPFS - can usually achieve 90% of the bandwidth available to a single client with a single thread. Lustre needs multiple parallel streams to saturate - say an Infiniband connection. Lustre is often faster than GPFS and often has superior metadata performance - particularly where lots of files are created in a single directory. GPFS can support Windows - Lustre cannot. I think GPFS is better integrated and easier to deploy than Lustre - some people disagree with me. Regards, Vic On 8 Aug 2014, at 14:14, Sergi Mor? Codina wrote: > Hi all, > > About main differences between GPFS and Lustre, here you have some bits from our experience: > > -Reliability: GPFS its been proved to be more stable and reliable. Also offers more flexibility in terms of fail-over. It have no restriction in number of servers. As far as I know, an NSD can have as many secondary servers as you want (we are using 8). > > -Metadata: In Lustre each file system is restricted to two servers. No restriction in GPFS. > > -Updates: In GPFS you can update the whole storage cluster without stopping production, one server at a time. > > -Server/Client role: As Jeremy said, in GPFS every server act as a client as well. Useful for administrative tasks. > > -Troubleshooting: Problems with GPFS are easier to track down. Logs are more clear, and offers better tools than Lustre. > > -Support: No problems at all with GPFS support. It is true that it could take time to go up within all support levels, but we always got a good solution. Quite different in terms of hardware. IBM support quality has drop a lot since about last year an a half. Really slow and tedious process to get replacements. Moreover, we keep receiving bad "certified reutilitzed parts" hardware, which slow the whole process even more. > > > These are the main differences I would stand out after some years of experience with both file systems, but do not take it as a fact. > > PD: Salvatore, I would suggest you to contact Jordi Valls. He joined EBI a couple of months ago, and has experience working with both file systems here at BSC. > > Best Regards, > Sergi. > > > On 08/08/2014 01:40 PM, Jeremy Robst wrote: >> On Fri, 8 Aug 2014, Salvatore Di Nardo wrote: >> >>> Now, skipping all this GSS rant, which have nothing to do with the file >>> system anyway and going back to my question: >>> >>> Could someone point the main differences between GPFS and Lustre? >> >> I'm looking at making the same decision here - to buy GPFS or to roll >> our own Lustre configuration. I'm in the process of setting up test >> systems, and so far the main difference seems to be in the that in GPFS >> each server sees the full filesystem, and so you can run other >> applications (e.g backup) on a GPFS server whereas the Luste OSS (object >> storage servers) see only a portion of the storage (the filesystem is >> striped across the OSSes), so you need a Lustre client to mount the full >> filesystem for things like backup. >> >> However I have very little practical experience of either and would also >> be interested in any comments. >> >> Thanks >> >> Jeremy >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > -- > > ------------------------------------------------------------------------ > > Sergi More Codina > Barcelona Supercomputing Center > Centro Nacional de Supercomputacion > WWW: http://www.bsc.es Tel: +34-93-405 42 27 > e-mail: sergi.more at bsc.es Fax: +34-93-413 77 21 > > ------------------------------------------------------------------------ > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer.htm > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kraemerf at de.ibm.com Sat Aug 9 15:03:02 2014 From: kraemerf at de.ibm.com (Frank Kraemer) Date: Sat, 9 Aug 2014 16:03:02 +0200 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: Message-ID: Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany From ewahl at osc.edu Mon Aug 11 14:55:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 11 Aug 2014 13:55:48 +0000 Subject: [gpfsug-discuss] GPFS and Lustre In-Reply-To: References: , Message-ID: In a similar vein, IBM has an application transparent "File Cache Library" as well. I believe it IS licensed and the only requirement is that it is for use on IBM hardware only. Saw some presentations that mention it in some BioSci talks @SC13 and the numbers for a couple of selected small read applications were awesome. I probably have the contact info for it around here somewhere. In addition to the pdf/user manual. Ed Wahl Ohio Supercomputer Center ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Frank Kraemer [kraemerf at de.ibm.com] Sent: Saturday, August 09, 2014 10:03 AM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] GPFS and Lustre Vic, Sergi, from my point of view for real High-End workloads the complete I/O stack needs to be fine tuned and well understood in order to provide a good system to the users. - Application(s) + I/O Lib(s) + MPI + Parallel Filesystem (e.g. GPFS) + Hardware (Networks, Servers, Disks, etc.) One of the best solutions to bring your application very efficently to work with a Parallel FS is Sionlib from FZ Juelich: Sionlib is a scalable I/O library for the parallel access to task-local files. The library not only supports writing and reading binary data to or from from several thousands of processors into a single or a small number of physical files but also provides for global open and close functions to access SIONlib file in parallel. SIONlib provides different interfaces: parallel access using MPI, OpenMp, or their combination and sequential access for post-processing utilities. http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/SIONlib/_node.html http://apps.fz-juelich.de/jsc/sionlib/html/sionlib_tutorial_2013.pdf -frank- P.S. Nice blog from Nils https://www.ibm.com/developerworks/community/blogs/storageneers/entry/scale_out_backup_with_tsm_and_gss_performance_test_results?lang=en Frank Kraemer IBM Consulting IT Specialist / Client Technical Architect Hechtsheimer Str. 2, 55131 Mainz mailto:kraemerf at de.ibm.com voice: +49171-3043699 IBM Germany _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From sabujp at gmail.com Tue Aug 12 23:16:22 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 12 Aug 2014 17:16:22 -0500 Subject: [gpfsug-discuss] reduce cnfs failover time to a few seconds Message-ID: Hi all, Is there anyway to reduce CNFS failover time to just a few seconds? Currently it seems like it's taking 5 - 10 minutes. We're using virtual ip's, i.e. interface bond1.1550:0 has one of the cnfs vips, so it should be fast, but it takes a long time and sometimes causes processes to crash due to NFS timeouts (some have 600 second soft mount timeouts). We've also noticed that it sometimes takes even longer unless the cnfs system on which we're calling mmshutdown is completely shutdown and isn't returning pings. Even 1 min seems too long. For comparison, I'm running ctdb + samba on the other NSDs and it's able to failover in a few seconds after mmshutdown completes. Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 15 14:31:29 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 15 Aug 2014 14:31:29 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters Message-ID: <53EE0BB1.8000005@ebi.ac.uk> Hello people, Its quite a bit of time that i'm triing to solve a problem to our GPFS system, without much luck so i think its time to ask some help. *First of a bit of introduction:** * Our GPFS system is made by 3xgss-26, In other words its made with 6x servers ( 4x10g links each) and several disk enclosures SAS attacked. The todal amount of spare its roughly 2PB, and the disks are SATA ( except few SSD dedicated to logtip ). My metadata and on dedicated vdisks, but both data and metadata vdiosks are in the same declustered arrays and recovery groups, so in the end they share the same spindles. The clients its a LSF farm configured as another cluster ( standard multiclustering configuration) of roughly 600 nodes . *The issue:** * Recently we became aware that when some massive io request has been done we experience a lot of client expells. Heres an example of our logs: Fri Aug 15 12:40:24.680 2014: Expel 10.7.28.34 (gss03a) request from 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.138 (ebi3-138 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:41.652 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.38 (ebi5-037 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:45.754 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.58 (ebi3-058 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:40:52.305 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.68 (ebi5-067 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:17.069 2014: Expel 10.7.28.35 (gss03b) request from 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.161 (ebi3-161 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:23.555 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.136 (ebi3-136 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.258 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.22 (ebi5-021 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:41:54.540 2014: Expel 10.7.28.66 (gss02b) request from 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.57 (ebi5-056 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:42:57.288 2014: Expel 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) request from 10.7.28.34 (gss03a). Expelling: 10.7.35.5 (ebi5-132 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:43:24.327 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.99 (ebi5-226 in ebi-cluster.ebi.ac.uk) Fri Aug 15 12:44:54.202 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.165 (ebi3-165 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:15:54.450 2014: Expel 10.7.28.34 (gss03a) request from 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.37.89 (ebi5-216 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:20:16.524 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.55 (ebi3-055 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:26:54.177 2014: Expel 10.7.28.34 (gss03a) request from 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.34.64 (ebi5-063 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:27:53.900 2014: Expel 10.7.28.3 (gss01b) request from 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk). Expelling: 10.7.35.15 (ebi5-142 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:28:24.297 2014: Expel 10.7.28.67 (gss02a) request from 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.50 (ebi3-050 in ebi-cluster.ebi.ac.uk) Fri Aug 15 13:29:23.913 2014: Expel 10.7.28.3 (gss01b) request from 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk). Expelling: 172.16.4.156 (ebi3-156 in ebi-cluster.ebi.ac.uk) at the same time we experience also long waiters queue (1000+ lines). An example in case of massive writes ( dd ) : 0x7F522E1EEF90 waiting 1.861233182 seconds, NSDThread: on ThCond 0x7F5158019B08 (0x7F5158019B08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.101 0x7F522E1EC9B0 waiting 1.490567470 seconds, NSDThread: on ThCond 0x7F50F4038BA8 (0x7F50F4038BA8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1EB6C0 waiting 1.077098046 seconds, NSDThread: on ThCond 0x7F50B40011F8 (0x7F50B40011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.156 0x7F522E1EA3D0 waiting 7.714968554 seconds, NSDThread: on ThCond 0x7F50BC0078B8 (0x7F50BC0078B8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.107 0x7F522E1E90E0 waiting 4.774379417 seconds, NSDThread: on ThCond 0x7F506801B1F8 (0x7F506801B1F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.23 0x7F522E1E7DF0 waiting 0.746172444 seconds, NSDThread: on ThCond 0x7F5094007D78 (0x7F5094007D78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.84 0x7F522E1E6B00 waiting 1.553030487 seconds, NSDThread: on ThCond 0x7F51C0004C78 (0x7F51C0004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.63 0x7F522E1E5810 waiting 2.165307633 seconds, NSDThread: on ThCond 0x7F5178016A08 (0x7F5178016A08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.29 0x7F522E1E4520 waiting 1.128089273 seconds, NSDThread: on ThCond 0x7F5074004D98 (0x7F5074004D98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1E3230 waiting 2.515214328 seconds, NSDThread: on ThCond 0x7F51F400EF08 (0x7F51F400EF08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.90 0x7F522E1E1F40 waiting*162.966840834* seconds, NSDThread: on ThCond 0x7F51840207A8 (0x7F51840207A8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.97 0x7F522E1E0C50 waiting 1.140787288 seconds, NSDThread: on ThCond 0x7F51AC005C08 (0x7F51AC005C08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.94 0x7F522E1DF960 waiting 41.907415248 seconds, NSDThread: on ThCond 0x7F5160019038 (0x7F5160019038) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1DE670 waiting 0.466560418 seconds, NSDThread: on ThCond 0x7F513802B258 (0x7F513802B258) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.168 0x7F522E1DD380 waiting 3.102803621 seconds, NSDThread: on ThCond 0x7F516C0106C8 (0x7F516C0106C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.91 0x7F522E1DC090 waiting 2.751614295 seconds, NSDThread: on ThCond 0x7F504C0011F8 (0x7F504C0011F8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.25 0x7F522E1DADA0 waiting 5.083691891 seconds, NSDThread: on ThCond 0x7F507401BE88 (0x7F507401BE88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.61 0x7F522E1D9AB0 waiting 2.263374184 seconds, NSDThread: on ThCond 0x7F5080003B98 (0x7F5080003B98) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.36 0x7F522E1D87C0 waiting 0.206989639 seconds, NSDThread: on ThCond 0x7F505801F0D8 (0x7F505801F0D8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.55 0x7F522E1D74D0 waiting *41.841279897* seconds, NSDThread: on ThCond 0x7F5194008B88 (0x7F5194008B88) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.143 0x7F522E1D61E0 waiting 5.618652361 seconds, NSDThread: on ThCond 0x1BAB868 (0x1BAB868) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.59 0x7F522E1D4EF0 waiting 6.185658427 seconds, NSDThread: on ThCond 0x7F513802AAE8 (0x7F513802AAE8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.35.6 0x7F522E1D3C00 waiting 2.652370892 seconds, NSDThread: on ThCond 0x7F5130004C78 (0x7F5130004C78) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.34.45 0x7F522E1D2910 waiting 11.396142225 seconds, NSDThread: on ThCond 0x7F51A401C0C8 (0x7F51A401C0C8) (MsgRecordCondvar), reason 'RPC wait' for getData on node 172.16.4.169 0x7F522E1D1620 waiting 63.710723043 seconds, NSDThread: on ThCond 0x7F5038004D08 (0x7F5038004D08) (MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.120 or for massive reads: 0x7FBCE69A8C20 waiting 29.262629530 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE699CEC0 waiting 29.260869141 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE698C5A0 waiting 29.124824888 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6984110 waiting 22.729479654 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69512C0 waiting 29.272805926 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE69409A0 waiting 28.833650198 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6924320 waiting 29.237067128 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6921D40 waiting 29.237953228 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6915FE0 waiting 29.046721161 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE6913A00 waiting 29.264534710 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6900B00 waiting 29.267691105 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68F7380 waiting 29.266402464 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68D2870 waiting 29.276298231 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68BADB0 waiting 28.665700576 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68B61F0 waiting 29.236878611 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6885980 waiting *144*.530487248 seconds, NSDThread: on ThMutex 0x1803396A670 (0xFFFFC9003396A670) (DiskSchedulingMutex) 0x7FBCE68833A0 waiting 29.231066610 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE68820B0 waiting 29.269954514 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE686A5F0 waiting *140*.662994256 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE6864740 waiting 29.254180742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE683FC30 waiting 29.271840565 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE682E020 waiting 29.200969209 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6825B90 waiting 19.136732919 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6805C40 waiting 29.236055550 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FEAA0 waiting 29.283264161 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67FC4C0 waiting 29.268992663 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67DFE40 waiting 29.150900786 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D2DF0 waiting 29.199058463 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67D1B00 waiting 29.203199738 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67768D0 waiting 29.208231742 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6768590 waiting 5.228192589 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE67672A0 waiting 29.252839376 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6757C70 waiting 28.869359044 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6748640 waiting 29.289284179 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6734450 waiting 29.253591817 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6730B80 waiting 29.289987273 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6720260 waiting 26.597589551 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66F32C0 waiting 29.177692849 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66E3C90 waiting 29.160268518 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66CC1D0 waiting 5.334330188 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE66B3420 waiting 34.274433161 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE668E910 waiting 27.699999488 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6689D50 waiting 34.279090465 seconds, NSDThread: on ThMutex 0x180339A3140 (0xFFFFC900339A3140) (DiskSchedulingMutex) 0x7FBCE66805D0 waiting 24.688626241 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE6675B60 waiting 35.367745840 seconds, NSDThread: on ThCond 0x18033B74D48 (0xFFFFC90033B74D48) (LeaseWaitCondvar), reason 'Waiting to acquire disklease' 0x7FBCE665E0A0 waiting 29.235994598 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' 0x7FBCE663CE60 waiting 29.162911979 seconds, NSDThread: on ThCond 0x7FBBF0045D40 (0x7FBBF0045D40) (VdiskLogAppendCondvar), reason 'wait for permission to append to log' Another example with mmfsadm in case of massive reads: [root at gss02b ~]# mmfsadm dump waiters 0x7F519000AEA0 waiting 28.915010347 seconds, replyCleanupThread: on ThCond 0x7F51101B27B8 (0x7F51101B27B8) (MsgRecordCondvar), reason 'RPC wait' 0x7F511C012A10 waiting 279.522206863 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F52000095F8 (0x7F52000095F8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5120000B80 waiting 279.524782437 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5214000EE8 (0x7F5214000EE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F5154006310 waiting 138.164386224 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F5174003F08 (0x7F5174003F08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1EB6C0 waiting 23.060703000 seconds, NSDThread: for poll on sock 85 0x7F522E1E6B00 waiting 0.068456104 seconds, NSDThread: on ThCond 0x7F50CC00E478 (0x7F50CC00E478) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1D0330 waiting 17.207907857 seconds, NSDThread: on ThCond 0x7F5078001688 (0x7F5078001688) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1BFA10 waiting 0.181011711 seconds, NSDThread: on ThCond 0x7F504000E558 (0x7F504000E558) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B4FA0 waiting 0.021780338 seconds, NSDThread: on ThCond 0x7F522000E488 (0x7F522000E488) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E1B3CB0 waiting 0.794718000 seconds, NSDThread: for poll on sock 799 0x7F522E186D10 waiting 0.191606803 seconds, NSDThread: on ThCond 0x7F5184015D58 (0x7F5184015D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E184730 waiting 0.025562000 seconds, NSDThread: for poll on sock 867 0x7F522E12CDD0 waiting 0.008921000 seconds, NSDThread: for poll on sock 543 0x7F522E126F20 waiting 1.459531000 seconds, NSDThread: for poll on sock 983 0x7F522E10F460 waiting 17.177936972 seconds, NSDThread: on ThCond 0x7F51EC002CE8 (0x7F51EC002CE8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E101120 waiting 17.232580316 seconds, NSDThread: on ThCond 0x7F51BC005BB8 (0x7F51BC005BB8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0F1AF0 waiting 438.556030000 seconds, NSDThread: for poll on sock 496 0x7F522E0E7080 waiting 393.702839774 seconds, NSDThread: on ThCond 0x7F5164013668 (0x7F5164013668) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E09DA60 waiting 52.746984660 seconds, NSDThread: on ThCond 0x7F506C008858 (0x7F506C008858) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E084CB0 waiting 23.096688206 seconds, NSDThread: on ThCond 0x7F521C008E18 (0x7F521C008E18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E0839C0 waiting 0.093456000 seconds, NSDThread: for poll on sock 962 0x7F522E076970 waiting 2.236659731 seconds, NSDThread: on ThCond 0x7F51E0027538 (0x7F51E0027538) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E044E10 waiting 52.752497765 seconds, NSDThread: on ThCond 0x7F513802BDD8 (0x7F513802BDD8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E033200 waiting 16.157355796 seconds, NSDThread: on ThCond 0x7F5104240D58 (0x7F5104240D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E02AD70 waiting 436.025203220 seconds, NSDThread: on ThCond 0x7F50E0016C28 (0x7F50E0016C28) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522E01A450 waiting 393.673252777 seconds, NSDThread: on ThCond 0x7F50A8009C18 (0x7F50A8009C18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DFE0460 waiting 1.781358358 seconds, NSDThread: on ThCond 0x7F51E0027638 (0x7F51E0027638) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF99420 waiting 0.038405427 seconds, NSDThread: on ThCond 0x7F50F0172B18 (0x7F50F0172B18) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF7CDA0 waiting 438.204625355 seconds, NSDThread: on ThCond 0x7F50900023D8 (0x7F50900023D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF76EF0 waiting 435.903645734 seconds, NSDThread: on ThCond 0x7F5084004BC8 (0x7F5084004BC8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF74910 waiting 21.749325022 seconds, NSDThread: on ThCond 0x7F507C011F48 (0x7F507C011F48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF71040 waiting 1.027274000 seconds, NSDThread: for poll on sock 866 0x7F522DF536D0 waiting 52.953847324 seconds, NSDThread: on ThCond 0x7F5200006FF8 (0x7F5200006FF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF510F0 waiting 0.039278000 seconds, NSDThread: for poll on sock 837 0x7F522DF4EB10 waiting 0.085745937 seconds, NSDThread: on ThCond 0x7F51F0006828 (0x7F51F0006828) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF4C530 waiting 21.850733000 seconds, NSDThread: for poll on sock 986 0x7F522DF4B240 waiting 0.054739884 seconds, NSDThread: on ThCond 0x7F51EC0168D8 (0x7F51EC0168D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF48C60 waiting 0.186409714 seconds, NSDThread: on ThCond 0x7F51E4000908 (0x7F51E4000908) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF41AC0 waiting 438.942861290 seconds, NSDThread: on ThCond 0x7F51CC010168 (0x7F51CC010168) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF3F4E0 waiting 0.060235106 seconds, NSDThread: on ThCond 0x7F51C400A438 (0x7F51C400A438) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF22E60 waiting 0.361288000 seconds, NSDThread: for poll on sock 518 0x7F522DF21B70 waiting 0.060722464 seconds, NSDThread: on ThCond 0x7F51580162D8 (0x7F51580162D8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DF12540 waiting 23.077564448 seconds, NSDThread: on ThCond 0x7F512C13E1E8 (0x7F512C13E1E8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEFD060 waiting 0.723370000 seconds, NSDThread: for poll on sock 503 0x7F522DEE09E0 waiting 1.565799175 seconds, NSDThread: on ThCond 0x7F5084004D58 (0x7F5084004D58) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDF6F0 waiting 22.063017342 seconds, NSDThread: on ThCond 0x7F5078003E08 (0x7F5078003E08) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDD110 waiting 0.049108780 seconds, NSDThread: on ThCond 0x7F5070001D78 (0x7F5070001D78) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEDAB30 waiting 229.603224376 seconds, NSDThread: on ThCond 0x7F50680221B8 (0x7F50680221B8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED7260 waiting 0.071855457 seconds, NSDThread: on ThCond 0x7F506400A5A8 (0x7F506400A5A8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DED5F70 waiting 0.648324000 seconds, NSDThread: for poll on sock 766 0x7F522DEC3070 waiting 1.809205756 seconds, NSDThread: on ThCond 0x7F522000E518 (0x7F522000E518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEB1460 waiting 436.017396645 seconds, NSDThread: on ThCond 0x7F51E4000978 (0x7F51E4000978) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DEAC8A0 waiting 393.734102000 seconds, NSDThread: for poll on sock 609 0x7F522DEA3120 waiting 17.960778837 seconds, NSDThread: on ThCond 0x7F51B4001708 (0x7F51B4001708) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE86AA0 waiting 23.112060045 seconds, NSDThread: on ThCond 0x7F5154096118 (0x7F5154096118) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE64570 waiting 0.076167410 seconds, NSDThread: on ThCond 0x7F50D8005EF8 (0x7F50D8005EF8) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DE1AF50 waiting 17.460836000 seconds, NSDThread: for poll on sock 737 0x7F522DE104E0 waiting 0.205037000 seconds, NSDThread: for poll on sock 865 0x7F522DDB8B80 waiting 0.106192000 seconds, NSDThread: for poll on sock 78 0x7F522DDA36A0 waiting 0.738921180 seconds, NSDThread: on ThCond 0x7F505400E048 (0x7F505400E048) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD9C500 waiting 0.731118367 seconds, NSDThread: on ThCond 0x7F503C00B518 (0x7F503C00B518) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F522DD89600 waiting 229.609363000 seconds, NSDThread: for poll on sock 515 0x7F522DD567B0 waiting 1.508489195 seconds, NSDThread: on ThCond 0x7F514C021F88 (0x7F514C021F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' Another thing worth to mention is that the filesystem its totaly unresponsive. Even a simple "cd" to a directory or an ls to a directory just hangs for several minutes ( litterally). This happens also if i try from the NSD servers. *Few things i have looked into:* * Our network seems fine, there might be some bottleneck on part of them, and this could explain the waiters, but doesnt explain why ad some poit those client ask to expel the NSD servers. THis also doesn't justify why the FS is slow even on NSD itself. * Disk bottleneck? i dont think so. NSD servers have cpu usage (and io wait ) very low. Also mmdiag --iohist seems condirming that the operation on the disks are reasonable fast: === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- ---- ------------------ --------------- 13:54:29.209276 W data 34:5066338808 2056 88.307 lcl sdtu 13:54:29.209277 W data 55:5095698936 2056 27.592 lcl sdaab 13:54:29.209278 W data 171:5104087544 2056 22.801 lcl sdtg 13:54:29.209279 W data 116:5011812856 2056 65.983 lcl sdqr 13:54:29.209280 W data 98:4860817912 2056 17.892 lcl sddl 13:54:29.209281 W data 159:4999229944 2056 21.324 lcl sdjg 13:54:29.209282 W data 84:5049561592 2056 31.932 lcl sdqz 13:54:29.209283 W data 8:5003424248 2056 30.912 lcl sdcw 13:54:29.209284 W data 23:4965675512 2056 27.366 lcl sdpt 13:54:29.297715 W vdiskMDLog 2:144008496 1 0.236 lcl sdkr 13:54:29.297717 W vdiskMDLog 0:331703600 1 0.230 lcl sdcm 13:54:29.297718 W vdiskMDLog 1:273769776 1 0.241 lcl sdbp 13:54:29.244902 W data 51:3857589752 2056 35.566 lcl sdyi 13:54:29.244904 W data 10:3773703672 2056 28.512 lcl sdma 13:54:29.244905 W data 48:3639485944 2056 24.124 lcl sdel 13:54:29.244906 W data 25:3777897976 2056 18.691 lcl sdgt 13:54:29.244908 W data 91:3832423928 2056 20.699 lcl sdlc 13:54:29.244909 W data 115:3723372024 2056 30.783 lcl sdho 13:54:29.244910 W data 173:3882755576 2056 53.241 lcl sdti 13:54:29.244911 W data 42:3782092280 2056 22.785 lcl sddz 13:54:29.244912 W data 45:3647874552 2056 24.289 lcl sdei 13:54:29.244913 W data 32:3652068856 2056 17.220 lcl sdbn 13:54:29.244914 W data 39:3677234680 2056 26.017 lcl sddw 13:54:29.298273 W vdiskMDLog 2:144008497 1 2.522 lcl sduf 13:54:29.298274 W vdiskMDLog 0:331703601 1 1.025 lcl sdlo 13:54:29.298275 W vdiskMDLog 1:273769777 1 2.586 lcl sdtt 13:54:29.288275 W data 27:2249588200 2056 20.071 lcl sdhb 13:54:29.288279 W data 33:2224422376 2056 19.682 lcl sdts 13:54:29.288281 W data 47:2115370472 2056 21.667 lcl sdwo 13:54:29.288282 W data 82:2316697064 2056 21.524 lcl sdxy 13:54:29.288283 W data 85:2232810984 2056 17.467 lcl sdra 13:54:29.288285 W data 30:2127953384 2056 18.475 lcl sdqg 13:54:29.288286 W data 67:1876295144 2056 16.383 lcl sdmx 13:54:29.288287 W data 64:2127953384 2056 21.908 lcl sduh 13:54:29.288288 W data 38:2253782504 2056 19.775 lcl sddv 13:54:29.288290 W data 15:2207645160 2056 20.599 lcl sdet 13:54:29.288291 W data 157:2283142632 2056 21.198 lcl sdiy Bonding problem on the interfaces? Mellanox ( interface card prodicer) drivers and firmware updated, and we even tested the system with a single link ( without bonding). Could someone help me with this? in particular: * What exactly are client are looking to decide that another node is unresponsive? Ping? i dont think so because both NSD servers and clients can be pinged, so what they look? if comeone can also specify what port are they using i can try to tcpdump what exactly is cauding this expell. * How can i monitor metadata operations to understand where EXACTLY is the bottleneck that causes this: [sdinardo at ebi5-001 ~]$ time ls /gpfs/nobackup/sdinardo 1 ebi3-054.ebi.ac.uk ebi3-154 ebi5-019.ebi.ac.uk ebi5-052 ebi5-101 ebi5-156 ebi5-197 ebi5-228 ebi5-262.ebi.ac.uk 10 ebi3-055 ebi3-155 ebi5-021.ebi.ac.uk ebi5-053 ebi5-104.ebi.ac.uk ebi5-160.ebi.ac.uk ebi5-198 ebi5-229 ebi5-263 2 ebi3-056.ebi.ac.uk ebi3-156 ebi5-022 ebi5-054.ebi.ac.uk ebi5-106 ebi5-161 ebi5-200 ebi5-230.ebi.ac.uk ebi5-264 3 ebi3-057 ebi3-157 ebi5-023 ebi5-056 ebi5-109 ebi5-162.ebi.ac.uk ebi5-201 ebi5-231.ebi.ac.uk ebi5-265 4 ebi3-058 ebi3-158.ebi.ac.uk ebi5-024.ebi.ac.uk ebi5-057 ebi5-110.ebi.ac.uk ebi5-163.ebi.ac.uk ebi5-202.ebi.ac.uk ebi5-232 ebi5-266.ebi.ac.uk 5 ebi3-059.ebi.ac.uk ebi3-160 ebi5-025 ebi5-060 ebi5-111.ebi.ac.uk ebi5-164 ebi5-204 ebi5-233 ebi5-267 6 ebi3-132 ebi3-161.ebi.ac.uk ebi5-026 ebi5-061.ebi.ac.uk ebi5-112.ebi.ac.uk ebi5-165 ebi5-205 ebi5-234 ebi5-269.ebi.ac.uk 7 ebi3-133 ebi3-163.ebi.ac.uk ebi5-028 ebi5-062.ebi.ac.uk ebi5-129.ebi.ac.uk ebi5-166 ebi5-206.ebi.ac.uk ebi5-236 ebi5-270 8 ebi3-134 ebi3-165 ebi5-030 ebi5-064 ebi5-131.ebi.ac.uk ebi5-169.ebi.ac.uk ebi5-207 ebi5-237 ebi5-271 9 ebi3-135 ebi3-166.ebi.ac.uk ebi5-031 ebi5-065 ebi5-132 ebi5-170.ebi.ac.uk ebi5-209 ebi5-239.ebi.ac.uk launcher.sh _*real 21m14.948s*_( WTH ?!?!?!) user 0m0.004s sys 0m0.014s I know that the question are not easy to answer, and i need to dig more, but could be very helpful if someone give me some hints about where to look at. My gpfs skills are limited since this is our first system and is in production for just few months, and the things stated to worsen just recenlty. In past we could get over 200Gb/s ( both read and write) without any issue. Now some clients get expelled even when data thoughuput is ad 4-5Gb/s. Thanks in advance for any help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 11:18:10 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 11:18:10 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance Message-ID: Hi all, You may be aware that the website has been down for about a week now. This is due to the amount of traffic to the website and the amount of people on the mailing list, we had seen a few issues on the system. In order to counter the issues, we are moving to a new system to counter any future issues, and ease of management. We are hoping to do this tonight ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then please let me know. I will, as part of the move over, will be sending a few test mails to make sure that mailing list is working correctly. Thanks for your patience -- Arif Ali gpfsug Admin IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Tue Aug 19 12:11:00 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Tue, 19 Aug 2014 12:11:00 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F330C4.808@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 20:59:47 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 20:59:47 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: This is a test mail to the mailing list please do not reply -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 11:18, Arif Ali wrote: > Hi all, > > You may be aware that the website has been down for about a week now. This > is due to the amount of traffic to the website and the amount of people on > the mailing list, we had seen a few issues on the system. > > In order to counter the issues, we are moving to a new system to counter > any future issues, and ease of management. We are hoping to do this tonight > ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then > please let me know. > > I will, as part of the move over, will be sending a few test mails to make > sure that mailing list is working correctly. > > Thanks for your patience > > -- > Arif Ali > gpfsug Admin > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at arif-ali.co.uk Tue Aug 19 23:41:48 2014 From: mail at arif-ali.co.uk (Arif Ali) Date: Tue, 19 Aug 2014 23:41:48 +0100 Subject: [gpfsug-discuss] gpfsug Maintenance In-Reply-To: References: Message-ID: Thanks for all your patience, The service should all be back up again -- Arif Ali IRC: arif-ali at freenode LinkedIn: http://uk.linkedin.com/in/arifali On 19 August 2014 20:59, Arif Ali wrote: > This is a test mail to the mailing list > > please do not reply > > -- > Arif Ali > > IRC: arif-ali at freenode > LinkedIn: http://uk.linkedin.com/in/arifali > > > On 19 August 2014 11:18, Arif Ali wrote: > >> Hi all, >> >> You may be aware that the website has been down for about a week now. >> This is due to the amount of traffic to the website and the amount of >> people on the mailing list, we had seen a few issues on the system. >> >> In order to counter the issues, we are moving to a new system to counter >> any future issues, and ease of management. We are hoping to do this tonight >> ( between 20:00 - 23:00 BST). If this causes an issue for anyone, then >> please let me know. >> >> I will, as part of the move over, will be sending a few test mails to >> make sure that mailing list is working correctly. >> >> Thanks for your patience >> >> -- >> Arif Ali >> gpfsug Admin >> >> IRC: arif-ali at freenode >> LinkedIn: http://uk.linkedin.com/in/arifali >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 08:57:23 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 08:57:23 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F454E3.40803@ebi.ac.uk> Still problems. Here some more detailed examples: *EXAMPLE 1:* *EBI5-220**( CLIENT)** *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a reply from node gss02b* Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. *GSS02B ( NSD SERVER)* ... Tue Aug 19 11:03:17.070 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from ** because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ** ebi5-220 ... *GSS02a ( NSD SERVER)* Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== *EXAMPLE 2*: *EBI5-038* Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing cluster GSS.ebi.ac.uk* Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems.* *GSS02a* Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled because of an expired lease.* Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Aug 20 09:03:03 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 20 Aug 2014 09:03:03 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F45637.8080000@ebi.ac.uk> Another interesting case about a specific waiter: was looking the waiters on GSS until i found those( i got those info collecting from all the servers with a script i did, so i was able to trace hanging connection while they was happening): gss03b.ebi.ac.uk:*235.373993397*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*235.152271998*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss02a.ebi.ac.uk:*214.079093620 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109 gss02a.ebi.ac.uk:*213.580199240 *(MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109 gss03b.ebi.ac.uk:*132.375138082*(MsgRecordCondvar), reason 'RPC wait' for getData on node 10.7.37.109 gss03b.ebi.ac.uk:*132.374973884 *(MsgRecordCondvar), reason 'RPC wait' for commMsgCheckMessages on node 10.7.37.109 the bolted number are seconds. looking at this page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information The web page claim that's, probably a network congestion, but i managed to login quick enough to the client and there the waiters was: [root at ebi5-236 ~]# mmdiag --waiters === mmdiag: waiters === 0x7F6690073460 waiting 147.973009173 seconds, RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0 (0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for LX lock' 0x7F65100036D0 waiting 140.458589856 seconds, WritebehindWorkerThread: on ThCond 0x7F6500000F98 (0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63A0001080 waiting 245.153055801 seconds, WritebehindWorkerThread: on ThCond 0x7F65D8002CF8 (0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C03D3D0 waiting 245.750977203 seconds, CleanBufferThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674802E360 waiting 244.159861966 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0002358 (0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C038810 waiting 251.086748430 seconds, SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398 (0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait' for I/O completion on node 10.7.28.35 0x7F674C036230 waiting 139.556735095 seconds, CleanBufferThread: on ThCond 0x7F65CC004C78 (0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C031670 waiting 144.327593052 seconds, WritebehindWorkerThread: on ThCond 0x7F672402D1A8 (0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C02A4D0 waiting 145.202712821 seconds, WritebehindWorkerThread: on ThCond 0x7F65440018F8 (0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674C0291E0 waiting 247.131569232 seconds, PrefetchWorkerThread: on ThCond 0x7F65740016C8 (0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748025BD0 waiting 11.631381523 seconds, replyCleanupThread: on ThCond 0x7F65E000A1F8 (0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait' 0x7F6748022300 waiting 245.616267612 seconds, WritebehindWorkerThread: on ThCond 0x7F6470001468 (0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6748021010 waiting 230.769670930 seconds, InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete' 0x7F674801B160 waiting 245.830554594 seconds, UnusedInodePrefetchThread: on ThCond 0x7F65B8004438 (0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F674800A820 waiting 252.332932000 seconds, Msg handler getData: for poll on sock 109 0x7F63F4023090 waiting 253.073535042 seconds, WritebehindWorkerThread: on ThCond 0x7F65C4000CC8 (0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64A4000CE0 waiting 145.049659249 seconds, WritebehindWorkerThread: on ThCond 0x7F6560000A98 (0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6778006D00 waiting 142.124664264 seconds, WritebehindWorkerThread: on ThCond 0x7F63DC000C08 (0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780046D0 waiting 251.751439453 seconds, WritebehindWorkerThread: on ThCond 0x7F6454000A98 (0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E4B70 waiting 142.431051232 seconds, WritebehindWorkerThread: on ThCond 0x7F63C80010D8 (0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67780E50D0 waiting 244.339624817 seconds, WritebehindWorkerThread: on ThCond 0x7F65BC001B98 (0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6434000B40 waiting 145.343700410 seconds, WritebehindWorkerThread: on ThCond 0x7F63B00036E8 (0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F670C0187A0 waiting 244.903963969 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0000FB8 (0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04E2F0 waiting 245.837137631 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4000A98 (0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C04AA20 waiting 139.713993908 seconds, WritebehindWorkerThread: on ThCond 0x7F6454002478 (0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C049730 waiting 252.434187472 seconds, WritebehindWorkerThread: on ThCond 0x7F65F4003708 (0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C044B70 waiting 131.515829048 seconds, Msg handler ccMsgPing: on ThCond 0x7F64DC1D4888 (0x7F64DC1D4888) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6758008DE0 waiting 149.548547226 seconds, Msg handler getData: on ThCond 0x7F645C002458 (0x7F645C002458) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F67580071D0 waiting 149.548543118 seconds, Msg handler commMsgCheckMessages: on ThCond 0x7F6450001C48 (0x7F6450001C48) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F6448001620 waiting 139.844870446 seconds, WritebehindWorkerThread: on ThCond 0x7F65F0003098 (0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F63F4000F80 waiting 245.044791905 seconds, WritebehindWorkerThread: on ThCond 0x7F6450001188 (0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F659C0033A0 waiting 243.464399305 seconds, PrefetchWorkerThread: on ThCond 0x7F6554002598 (0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6514001690 waiting 245.826160463 seconds, PrefetchWorkerThread: on ThCond 0x7F65A4004558 (0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64800012B0 waiting 253.174835511 seconds, WritebehindWorkerThread: on ThCond 0x7F65E0000FB8 (0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6510000EE0 waiting 140.746696039 seconds, WritebehindWorkerThread: on ThCond 0x7F647C000CC8 (0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6754001BB0 waiting 246.336055629 seconds, PrefetchWorkerThread: on ThCond 0x7F6594002498 (0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6420000930 waiting 140.606777450 seconds, WritebehindWorkerThread: on ThCond 0x7F6578002498 (0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744009110 waiting 137.466372831 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65F4007158 (0x7F65F4007158) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67280119F0 waiting 144.173427360 seconds, WritebehindWorkerThread: on ThCond 0x7F6504000AE8 (0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F672800BB40 waiting 145.804301887 seconds, WritebehindWorkerThread: on ThCond 0x7F6550001038 (0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6728000910 waiting 252.601993452 seconds, WritebehindWorkerThread: on ThCond 0x7F6450000A98 (0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F6744007E20 waiting 251.603329204 seconds, WritebehindWorkerThread: on ThCond 0x7F6570004C18 (0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F64AC002EF0 waiting 139.205774422 seconds, FileBlockWriteFetchHandlerThread: on ThCond 0x18020AF0260 (0xFFFFC90020AF0260) (FetchFlowControlCondvar), reason 'wait for buffer for fetch' 0x7F6724013050 waiting 71.501580932 seconds, Msg handler ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608) (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg' 0x7F661C000DA0 waiting 245.654985276 seconds, PrefetchWorkerThread: on ThCond 0x7F6570005288 (0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00F440 waiting 251.096002003 seconds, FileBlockReadFetchHandlerThread: on ThCond 0x7F65BC002878 (0x7F65BC002878) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F671C00E150 waiting 144.034006970 seconds, WritebehindWorkerThread: on ThCond 0x7F6528001548 (0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FCD20 waiting 142.324070945 seconds, WritebehindWorkerThread: on ThCond 0x7F6580002A98 (0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A02FA330 waiting 200.670114385 seconds, EEWatchDogThread: on ThCond 0x7F65B0000A98 (0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait' 0x7F67A02BF050 waiting 252.276161189 seconds, WritebehindWorkerThread: on ThCond 0x7F6584003998 (0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.7.28.35 0x7F67A0004160 waiting 251.173651822 seconds, SyncHandlerThread: on ThCond 0x7F64880079E8 (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write' So from the client side its the client that's waiting the server. I managed also to ping, ssh, and tcpdump each other before the node got expelled and discovered that ping works fine, ssh work fine , beside my tests there are 0 packet passing between them, LITERALLY. So there is no congestion, no network issues, but the server waits for the client and the client waits the server. This happens until we reach 350 secs ( 10 times the lease time) , then client get expelled. There are no local io waiters that indicates that gss is struggling, there is plenty of bandwith and CPU resources and no network congestion. Seems some sort of deadlock to me, but how can this be explained and hopefully fixed? Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at gpfsug.org Thu Aug 21 09:20:39 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Thu, 21 Aug 2014 09:20:39 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F454E3.40803@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> Message-ID: <53F5ABD7.80107@gpfsug.org> Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: > Still problems. Here some more detailed examples: > > *EXAMPLE 1:* > > *EBI5-220**( CLIENT)** > *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a > reply from node gss02b* > Tue Aug 19 11:03:04.981 2014: Request sent to > (gss02a in GSS.ebi.ac.uk) to expel (gss02b in > GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:04.982 2014: This node will be expelled > from cluster GSS.ebi.ac.uk due to expel msg from IP> (ebi5-220) > Tue Aug 19 11:03:09.319 2014: Cluster Manager connection > broke. Probing cluster GSS.ebi.ac.uk > Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:03:10.322 2014: Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems. > Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount > invoked. File system: gpfs1 Reason: SGPanic > Tue Aug 19 11:03:12.066 2014: Connecting to > gss02a > Tue Aug 19 11:03:12.070 2014: Connected to > gss02a > Tue Aug 19 11:03:17.071 2014: Connecting to > gss02b > Tue Aug 19 11:03:17.072 2014: Connecting to > gss03b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss03a > Tue Aug 19 11:03:17.080 2014: Connecting to > gss01b > Tue Aug 19 11:03:17.079 2014: Connecting to > gss01a > Tue Aug 19 11:04:23.105 2014: Connected to > gss02b > Tue Aug 19 11:04:23.107 2014: Connected to > gss03b > Tue Aug 19 11:04:23.112 2014: Connected to > gss03a > Tue Aug 19 11:04:23.115 2014: Connected to > gss01b > Tue Aug 19 11:04:23.121 2014: Connected to > gss01a > Tue Aug 19 11:12:28.992 2014: Node (gss02a in > GSS.ebi.ac.uk) is now the Group Leader. > > *GSS02B ( NSD SERVER)* > ... > Tue Aug 19 11:03:17.070 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:25.016 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:28.080 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:36.019 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:39.083 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:47.023 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:50.088 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:52.218 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:03:58.030 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:01.092 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:03.220 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:09.034 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:12.096 2014: Killing connection from > ** because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:14.224 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:20.037 2014: Killing connection from > because the group is not ready for it to > rejoin, err 46 > Tue Aug 19 11:04:23.103 2014: Accepted and connected to > ** ebi5-220 > ... > > *GSS02a ( NSD SERVER)* > Tue Aug 19 11:03:04.980 2014: Expel (gss02b) > request from (ebi5-220 in > ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 > in ebi-cluster.ebi.ac.uk) > Tue Aug 19 11:03:12.069 2014: Accepted and connected to > ebi5-220 > > > =============================================== > *EXAMPLE 2*: > > *EBI5-038* > Tue Aug 19 11:32:34.227 2014: *Disk lease period expired > in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* > Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing > cluster GSS.ebi.ac.uk* > Tue Aug 19 11:35:24.265 2014: Close connection to IP> gss02a (Connection reset by peer). Attempting > reconnect. > Tue Aug 19 11:35:24.865 2014: Close connection to > ebi5-014 (Connection reset by > peer). Attempting reconnect. > ... > LOT MORE RESETS BY PEER > ... > Tue Aug 19 11:35:25.096 2014: Close connection to > ebi5-167 (Connection reset by > peer). Attempting reconnect. > Tue Aug 19 11:35:25.267 2014: Connecting to > gss02a > Tue Aug 19 11:35:25.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:35:26.267 2014: Retry connection to IP> gss02a > Tue Aug 19 11:35:26.268 2014: Close connection to IP> gss02a (Connection failed because destination > is still processing previous node failure) > Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum > nodes during cluster probe. > Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster > GSS.ebi.ac.uk. Unmounting file systems.* > > *GSS02a* > Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 > in ebi-cluster.ebi.ac.uk) *is being expelled because of an > expired lease.* Pings sent: 60. Replies received: 60. > > > > > In example 1 seems that an NSD was not repliyng to the client, but the > servers seems working fine.. how can i trace better ( to solve) the > problem? > > In example 2 it seems to me that for some reason the manager are not > renewing the lease in time. when this happens , its not a single client. > Loads of them fail to get the lease renewed. Why this is happening? > how can i trace to the source of the problem? > > > > Thanks in advance for any tips. > > Regards, > Salvatore > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 10:04:47 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 10:04:47 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5ABD7.80107@gpfsug.org> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> Message-ID: <53F5B62F.1060305@ebi.ac.uk> Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called */input_file/* of nearly 100GB: if from *client A* we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) *client B* : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: > Hi there, > > I've seen the on several 'stock'? 'core'? GPFS system (we need a > better term now GSS is out) and seen ping 'working', but alongside > ejections from the cluster. > The GPFS internode 'ping' is somewhat more circumspect than unix ping > - and rightly so. > > In my experience this has _always_ been a network issue of one sort of > another. If the network is experiencing issues, nodes will be ejected. > Of course it could be unresponsive mmfsd or high loadavg, but I've > seen that only twice in 10 years over many versions of GPFS. > > You need to follow the logs through from each machine in time order to > determine who could not see who and in what order. > Your best way forward is to log a SEV2 case with IBM support, directly > or via your OEM and collect and supply a snap and traces as required > by support. > > Without knowing your full setup, it's hard to help further. > > Jez > > On 20/08/14 08:57, Salvatore Di Nardo wrote: >> Still problems. Here some more detailed examples: >> >> *EXAMPLE 1:* >> >> *EBI5-220**( CLIENT)** >> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >> reply from node gss02b* >> Tue Aug 19 11:03:04.981 2014: Request sent to >> (gss02a in GSS.ebi.ac.uk) to expel (gss02b in >> GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:04.982 2014: This node will be expelled >> from cluster GSS.ebi.ac.uk due to expel msg from >> (ebi5-220) >> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >> broke. Probing cluster GSS.ebi.ac.uk >> Tue Aug 19 11:03:10.321 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems. >> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >> invoked. File system: gpfs1 Reason: SGPanic >> Tue Aug 19 11:03:12.066 2014: Connecting to >> gss02a >> Tue Aug 19 11:03:12.070 2014: Connected to >> gss02a >> Tue Aug 19 11:03:17.071 2014: Connecting to >> gss02b >> Tue Aug 19 11:03:17.072 2014: Connecting to >> gss03b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss03a >> Tue Aug 19 11:03:17.080 2014: Connecting to >> gss01b >> Tue Aug 19 11:03:17.079 2014: Connecting to >> gss01a >> Tue Aug 19 11:04:23.105 2014: Connected to >> gss02b >> Tue Aug 19 11:04:23.107 2014: Connected to >> gss03b >> Tue Aug 19 11:04:23.112 2014: Connected to >> gss03a >> Tue Aug 19 11:04:23.115 2014: Connected to >> gss01b >> Tue Aug 19 11:04:23.121 2014: Connected to >> gss01a >> Tue Aug 19 11:12:28.992 2014: Node (gss02a in >> GSS.ebi.ac.uk) is now the Group Leader. >> >> *GSS02B ( NSD SERVER)* >> ... >> Tue Aug 19 11:03:17.070 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:25.016 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:28.080 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:36.019 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:39.083 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:47.023 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:50.088 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:52.218 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:03:58.030 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:01.092 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:03.220 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:09.034 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:12.096 2014: Killing connection from >> ** because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:14.224 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:20.037 2014: Killing connection from >> because the group is not ready for it to >> rejoin, err 46 >> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >> ** ebi5-220 >> ... >> >> *GSS02a ( NSD SERVER)* >> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >> request from (ebi5-220 in >> ebi-cluster.ebi.ac.uk). Expelling: >> (ebi5-220 in ebi-cluster.ebi.ac.uk) >> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >> ebi5-220 >> >> >> =============================================== >> *EXAMPLE 2*: >> >> *EBI5-038* >> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >> cluster GSS.ebi.ac.uk* >> Tue Aug 19 11:35:24.265 2014: Close connection to > IP> gss02a (Connection reset by peer). Attempting >> reconnect. >> Tue Aug 19 11:35:24.865 2014: Close connection to >> ebi5-014 (Connection reset by >> peer). Attempting reconnect. >> ... >> LOT MORE RESETS BY PEER >> ... >> Tue Aug 19 11:35:25.096 2014: Close connection to >> ebi5-167 (Connection reset by >> peer). Attempting reconnect. >> Tue Aug 19 11:35:25.267 2014: Connecting to >> gss02a >> Tue Aug 19 11:35:25.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:35:26.267 2014: Retry connection to > IP> gss02a >> Tue Aug 19 11:35:26.268 2014: Close connection to > IP> gss02a (Connection failed because destination >> is still processing previous node failure) >> Tue Aug 19 11:36:24.276 2014: Unable to contact any >> quorum nodes during cluster probe. >> Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster >> GSS.ebi.ac.uk. Unmounting file systems.* >> >> *GSS02a* >> Tue Aug 19 11:35:24.263 2014: Node >> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >> because of an expired lease.* Pings sent: 60. Replies >> received: 60. >> >> >> >> >> In example 1 seems that an NSD was not repliyng to the client, but >> the servers seems working fine.. how can i trace better ( to solve) >> the problem? >> >> In example 2 it seems to me that for some reason the manager are not >> renewing the lease in time. when this happens , its not a single client. >> Loads of them fail to get the lease renewed. Why this is happening? >> how can i trace to the source of the problem? >> >> >> >> Thanks in advance for any tips. >> >> Regards, >> Salvatore >> >> >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Aug 21 13:48:38 2014 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 21 Aug 2014 12:48:38 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>,<53F5B62F.1060305@ebi.ac.uk> Message-ID: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. [cid:2fb2253c-3ffb-4ac6-88a8-d019b1a24f66] Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From jbernard at jumptrading.com Thu Aug 21 13:52:05 2014 From: jbernard at jumptrading.com (Jon Bernard) Date: Thu, 21 Aug 2014 12:52:05 +0000 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk>, <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: Where is that from? On Aug 21, 2014, at 7:49, "Bryan Banister" > wrote: As I understand GPFS distributed locking semantics, GPFS will not allow one node to hold a write lock for a file indefinitely. Once Client B opens the file for writing it would have contacted the File System Manager to obtain the lock. The FS manager would have told Client B that Client A has the lock and that Client B would have to contact Client A and revoke the write lock token. If Client A does not respond to Client B's request to revoke the write token, then Client B will ask that Client A be expelled from the cluster for NOT adhering to the proper protocol for write lock contention. Have you checked the communication path between the two clients at this point? I could not follow the logs that you provided. You should definitely look at the exact sequence of log events on the two clients and the file system manager (as reported by mmlsmgr). Hope that helps, -Bryan ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo [sdinardo at ebi.ac.uk] Sent: Thursday, August 21, 2014 4:04 AM To: chair at gpfsug.org; gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs client expels Thanks for the feedback, but we managed to find a scenario that excludes network problems. we have a file called input_file of nearly 100GB: if from client A we do: cat input_file >> output_file it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... if now we do the same from another client ( or just another shell on the same client) client B : cat input_file >> output_file ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. Regards, Salvatore On 21/08/14 09:20, Jez Tucker (Chair) wrote: Hi there, I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. You need to follow the logs through from each machine in time order to determine who could not see who and in what order. Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. Without knowing your full setup, it's hard to help further. Jez On 20/08/14 08:57, Salvatore Di Nardo wrote: Still problems. Here some more detailed examples: EXAMPLE 1: EBI5-220 ( CLIENT) Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic Tue Aug 19 11:03:12.066 2014: Connecting to gss02a Tue Aug 19 11:03:12.070 2014: Connected to gss02a Tue Aug 19 11:03:17.071 2014: Connecting to gss02b Tue Aug 19 11:03:17.072 2014: Connecting to gss03b Tue Aug 19 11:03:17.079 2014: Connecting to gss03a Tue Aug 19 11:03:17.080 2014: Connecting to gss01b Tue Aug 19 11:03:17.079 2014: Connecting to gss01a Tue Aug 19 11:04:23.105 2014: Connected to gss02b Tue Aug 19 11:04:23.107 2014: Connected to gss03b Tue Aug 19 11:04:23.112 2014: Connected to gss03a Tue Aug 19 11:04:23.115 2014: Connected to gss01b Tue Aug 19 11:04:23.121 2014: Connected to gss01a Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. GSS02B ( NSD SERVER) ... Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 ... GSS02a ( NSD SERVER) Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 =============================================== EXAMPLE 2: EBI5-038 Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. ... LOT MORE RESETS BY PEER ... Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. Tue Aug 19 11:35:25.267 2014: Connecting to gss02a Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. GSS02a Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? Thanks in advance for any tips. Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GPFS_Token_Protocol.png Type: image/png Size: 249179 bytes Desc: GPFS_Token_Protocol.png URL: From viccornell at gmail.com Thu Aug 21 14:03:14 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:03:14 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Hi Salvatore, Are you using ethernet or infiniband as the GPFS interconnect to your clients? If 10/40GbE - do you have a separate admin network? I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. Vic On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: > Thanks for the feedback, but we managed to find a scenario that excludes network problems. > > we have a file called input_file of nearly 100GB: > > if from client A we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on the same client) client B : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. > Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. > > > > As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. > Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> EXAMPLE 1: >>> EBI5-220 ( CLIENT) >>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> GSS02B ( NSD SERVER) >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>> ... >>> >>> GSS02a ( NSD SERVER) >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>> >>> >>> =============================================== >>> EXAMPLE 2: >>> >>> EBI5-038 >>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>> >>> GSS02a >>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:04:59 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:04:59 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org>, <53F5B62F.1060305@ebi.ac.uk> <21BC488F0AEA2245B2C3E83FC0B33DBB8263D9@CHI-EXCHANGEW2.w2k.jumptrading.com> Message-ID: <53F5EE7B.2080306@ebi.ac.uk> Thanks for the info... it helps a bit understanding whats going on, but i think you missed the part that Node A and Node B could also be the same machine. If for instance i ran 2 cp on the same machine, hence Client B cannot have problems contacting Client A since they are the same machine..... BTW i did the same also using 2 clients and the result its the same. Nonetheless your description is made me understand a bit better what's going on Regards, Salvatore On 21/08/14 13:48, Bryan Banister wrote: > As I understand GPFS distributed locking semantics, GPFS will not > allow one node to hold a write lock for a file indefinitely. Once > Client B opens the file for writing it would have contacted the File > System Manager to obtain the lock. The FS manager would have told > Client B that Client A has the lock and that Client B would have to > contact Client A and revoke the write lock token. If Client A does > not respond to Client B's request to revoke the write token, then > Client B will ask that Client A be expelled from the cluster for NOT > adhering to the proper protocol for write lock contention. > > > > Have you checked the communication path between the two clients at > this point? > > I could not follow the logs that you provided. You should definitely > look at the exact sequence of log events on the two clients and the > file system manager (as reported by mmlsmgr). > > Hope that helps, > -Bryan > > ------------------------------------------------------------------------ > *From:* gpfsug-discuss-bounces at gpfsug.org > [gpfsug-discuss-bounces at gpfsug.org] on behalf of Salvatore Di Nardo > [sdinardo at ebi.ac.uk] > *Sent:* Thursday, August 21, 2014 4:04 AM > *To:* chair at gpfsug.org; gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] gpfs client expels > > Thanks for the feedback, but we managed to find a scenario that > excludes network problems. > > we have a file called */input_file/* of nearly 100GB: > > if from *client A* we do: > > cat input_file >> output_file > > it start copying.. and we see waiter goeg a bit up,secs but then they > flushes back to 0, so we xcan say that the copy proceed well... > > > if now we do the same from another client ( or just another shell on > the same client) *client B* : > > cat input_file >> output_file > > > ( in other words we are trying to write to the same destination) all > the waiters gets up until one node get expelled. > > > Now, while its understandable that the destination file is locked for > one of the "cat", so have to wait ( and since the file is BIG , have > to wait for a while), its not understandable why it stop the renewal > lease. > Why its doen't return just a timeout error on the copy instead to > expel the node? We can reproduce this every time, and since our users > to operations like this on files over 100GB each you can imagine the > result. > > > > As you can imagine even if its a bit silly to write at the same time > to the same destination, its also quite common if we want to dump to a > log file logs and for some reason one of the writers, write for a lot > of time keeping the file locked. > Our expels are not due to network congestion, but because a write > attempts have to wait another one. What i really dont understand is > why to take a so expreme mesure to expell jest because a process is > waiteing "to too much time". > > > I have ticket opened to IBM for this and the issue is under > investigation, but no luck so far.. > > Regards, > Salvatore > > > > On 21/08/14 09:20, Jez Tucker (Chair) wrote: >> Hi there, >> >> I've seen the on several 'stock'? 'core'? GPFS system (we need a >> better term now GSS is out) and seen ping 'working', but alongside >> ejections from the cluster. >> The GPFS internode 'ping' is somewhat more circumspect than unix ping >> - and rightly so. >> >> In my experience this has _always_ been a network issue of one sort >> of another. If the network is experiencing issues, nodes will be >> ejected. >> Of course it could be unresponsive mmfsd or high loadavg, but I've >> seen that only twice in 10 years over many versions of GPFS. >> >> You need to follow the logs through from each machine in time order >> to determine who could not see who and in what order. >> Your best way forward is to log a SEV2 case with IBM support, >> directly or via your OEM and collect and supply a snap and traces as >> required by support. >> >> Without knowing your full setup, it's hard to help further. >> >> Jez >> >> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>> Still problems. Here some more detailed examples: >>> >>> *EXAMPLE 1:* >>> >>> *EBI5-220**( CLIENT)** >>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>> reply from node gss02b* >>> Tue Aug 19 11:03:04.981 2014: Request sent to >> IP> (gss02a in GSS.ebi.ac.uk) to expel >>> (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:04.982 2014: This node will be expelled >>> from cluster GSS.ebi.ac.uk due to expel msg from >>> (ebi5-220) >>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection >>> broke. Probing cluster GSS.ebi.ac.uk >>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster >>> GSS.ebi.ac.uk. Unmounting file systems. >>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>> invoked. File system: gpfs1 Reason: SGPanic >>> Tue Aug 19 11:03:12.066 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:03:12.070 2014: Connected to >>> gss02a >>> Tue Aug 19 11:03:17.071 2014: Connecting to >>> gss02b >>> Tue Aug 19 11:03:17.072 2014: Connecting to >>> gss03b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss03a >>> Tue Aug 19 11:03:17.080 2014: Connecting to >>> gss01b >>> Tue Aug 19 11:03:17.079 2014: Connecting to >>> gss01a >>> Tue Aug 19 11:04:23.105 2014: Connected to >>> gss02b >>> Tue Aug 19 11:04:23.107 2014: Connected to >>> gss03b >>> Tue Aug 19 11:04:23.112 2014: Connected to >>> gss03a >>> Tue Aug 19 11:04:23.115 2014: Connected to >>> gss01b >>> Tue Aug 19 11:04:23.121 2014: Connected to >>> gss01a >>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>> in GSS.ebi.ac.uk) is now the Group Leader. >>> >>> *GSS02B ( NSD SERVER)* >>> ... >>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>> ** because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>> because the group is not ready for it to >>> rejoin, err 46 >>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>> ** ebi5-220 >>> ... >>> >>> *GSS02a ( NSD SERVER)* >>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) >>> request from (ebi5-220 in >>> ebi-cluster.ebi.ac.uk). Expelling: >>> (ebi5-220 in ebi-cluster.ebi.ac.uk) >>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>> ebi5-220 >>> >>> >>> =============================================== >>> *EXAMPLE 2*: >>> >>> *EBI5-038* >>> Tue Aug 19 11:32:34.227 2014: *Disk lease period expired >>> in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.* >>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing >>> cluster GSS.ebi.ac.uk* >>> Tue Aug 19 11:35:24.265 2014: Close connection to >>> gss02a (Connection reset by peer). >>> Attempting reconnect. >>> Tue Aug 19 11:35:24.865 2014: Close connection to >>> ebi5-014 (Connection reset by >>> peer). Attempting reconnect. >>> ... >>> LOT MORE RESETS BY PEER >>> ... >>> Tue Aug 19 11:35:25.096 2014: Close connection to >>> ebi5-167 (Connection reset by >>> peer). Attempting reconnect. >>> Tue Aug 19 11:35:25.267 2014: Connecting to >>> gss02a >>> Tue Aug 19 11:35:25.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>> gss02a >>> Tue Aug 19 11:35:26.268 2014: Close connection to >>> gss02a (Connection failed because >>> destination is still processing previous node failure) >>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>> quorum nodes during cluster probe. >>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>> cluster GSS.ebi.ac.uk. Unmounting file systems.* >>> >>> *GSS02a* >>> Tue Aug 19 11:35:24.263 2014: Node >>> (ebi5-038 in ebi-cluster.ebi.ac.uk) *is being expelled >>> because of an expired lease.* Pings sent: 60. Replies >>> received: 60. >>> >>> >>> >>> >>> In example 1 seems that an NSD was not repliyng to the client, but >>> the servers seems working fine.. how can i trace better ( to solve) >>> the problem? >>> >>> In example 2 it seems to me that for some reason the manager are not >>> renewing the lease in time. when this happens , its not a single >>> client. >>> Loads of them fail to get the lease renewed. Why this is happening? >>> how can i trace to the source of the problem? >>> >>> >>> >>> Thanks in advance for any tips. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged > information. If you are not the intended recipient, you are hereby > notified that any review, dissemination or copying of this email is > strictly prohibited, and to please notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does > not make any guarantees as to the completeness or accuracy of this > email or any attachments. This email is for informational purposes > only and does not constitute a recommendation, offer, request or > solicitation of any kind to buy, sell, subscribe, redeem or perform > any type of transaction of a financial product. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 249179 bytes Desc: not available URL: From sdinardo at ebi.ac.uk Thu Aug 21 14:18:19 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 21 Aug 2014 14:18:19 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> Message-ID: <53F5F19B.1010603@ebi.ac.uk> This is an interesting point! We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. I will forward this consideration to our network admin , so see if we can work on a dedicated network. thanks for your tip. Regards, Salvatore On 21/08/14 14:03, Vic Cornell wrote: > Hi Salvatore, > > Are you using ethernet or infiniband as the GPFS interconnect to your > clients? > > If 10/40GbE - do you have a separate admin network? > > I have seen behaviour similar to this where the storage traffic causes > congestion and the "admin" traffic gets lost or delayed causing expels. > > Vic > > > > On 21 Aug 2014, at 10:04, Salvatore Di Nardo > wrote: > >> Thanks for the feedback, but we managed to find a scenario that >> excludes network problems. >> >> we have a file called */input_file/* of nearly 100GB: >> >> if from *client A* we do: >> >> cat input_file >> output_file >> >> it start copying.. and we see waiter goeg a bit up,secs but then they >> flushes back to 0, so we xcan say that the copy proceed well... >> >> >> if now we do the same from another client ( or just another shell on >> the same client) *client B* : >> >> cat input_file >> output_file >> >> >> ( in other words we are trying to write to the same destination) all >> the waiters gets up until one node get expelled. >> >> >> Now, while its understandable that the destination file is locked for >> one of the "cat", so have to wait ( and since the file is BIG , have >> to wait for a while), its not understandable why it stop the renewal >> lease. >> Why its doen't return just a timeout error on the copy instead to >> expel the node? We can reproduce this every time, and since our users >> to operations like this on files over 100GB each you can imagine the >> result. >> >> >> >> As you can imagine even if its a bit silly to write at the same time >> to the same destination, its also quite common if we want to dump to >> a log file logs and for some reason one of the writers, write for a >> lot of time keeping the file locked. >> Our expels are not due to network congestion, but because a write >> attempts have to wait another one. What i really dont understand is >> why to take a so expreme mesure to expell jest because a process is >> waiteing "to too much time". >> >> >> I have ticket opened to IBM for this and the issue is under >> investigation, but no luck so far.. >> >> Regards, >> Salvatore >> >> >> >> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>> Hi there, >>> >>> I've seen the on several 'stock'? 'core'? GPFS system (we need a >>> better term now GSS is out) and seen ping 'working', but alongside >>> ejections from the cluster. >>> The GPFS internode 'ping' is somewhat more circumspect than unix >>> ping - and rightly so. >>> >>> In my experience this has _always_ been a network issue of one sort >>> of another. If the network is experiencing issues, nodes will be >>> ejected. >>> Of course it could be unresponsive mmfsd or high loadavg, but I've >>> seen that only twice in 10 years over many versions of GPFS. >>> >>> You need to follow the logs through from each machine in time order >>> to determine who could not see who and in what order. >>> Your best way forward is to log a SEV2 case with IBM support, >>> directly or via your OEM and collect and supply a snap and traces as >>> required by support. >>> >>> Without knowing your full setup, it's hard to help further. >>> >>> Jez >>> >>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>> Still problems. Here some more detailed examples: >>>> >>>> *EXAMPLE 1:* >>>> >>>> *EBI5-220**( CLIENT)** >>>> *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a >>>> reply from node gss02b* >>>> Tue Aug 19 11:03:04.981 2014: Request sent to >>> IP> (gss02a in GSS.ebi.ac.uk ) to >>>> expel (gss02b in GSS.ebi.ac.uk >>>> ) from cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:04.982 2014: This node will be >>>> expelled from cluster GSS.ebi.ac.uk >>>> due to expel msg from >>> IP> (ebi5-220) >>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager >>>> connection broke. Probing cluster GSS.ebi.ac.uk >>>> >>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:03:10.322 2014: Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems. >>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount >>>> invoked. File system: gpfs1 Reason: SGPanic >>>> Tue Aug 19 11:03:12.066 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:03:12.070 2014: Connected to >>>> gss02a >>>> Tue Aug 19 11:03:17.071 2014: Connecting to >>>> gss02b >>>> Tue Aug 19 11:03:17.072 2014: Connecting to >>>> gss03b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss03a >>>> Tue Aug 19 11:03:17.080 2014: Connecting to >>>> gss01b >>>> Tue Aug 19 11:03:17.079 2014: Connecting to >>>> gss01a >>>> Tue Aug 19 11:04:23.105 2014: Connected to >>>> gss02b >>>> Tue Aug 19 11:04:23.107 2014: Connected to >>>> gss03b >>>> Tue Aug 19 11:04:23.112 2014: Connected to >>>> gss03a >>>> Tue Aug 19 11:04:23.115 2014: Connected to >>>> gss01b >>>> Tue Aug 19 11:04:23.121 2014: Connected to >>>> gss01a >>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a >>>> in GSS.ebi.ac.uk ) is now the >>>> Group Leader. >>>> >>>> *GSS02B ( NSD SERVER)* >>>> ... >>>> Tue Aug 19 11:03:17.070 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:25.016 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:28.080 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:36.019 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:39.083 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:47.023 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:50.088 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:03:52.218 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:03:58.030 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:01.092 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:03.220 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:09.034 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:12.096 2014: Killing connection from >>>> ** because the group is not ready for it >>>> to rejoin, err 46 >>>> Tue Aug 19 11:04:14.224 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:20.037 2014: Killing connection from >>>> because the group is not ready for it to >>>> rejoin, err 46 >>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to >>>> ** ebi5-220 >>>> ... >>>> >>>> *GSS02a ( NSD SERVER)* >>>> Tue Aug 19 11:03:04.980 2014: Expel >>>> (gss02b) request from (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ). >>>> Expelling: (ebi5-220 in >>>> ebi-cluster.ebi.ac.uk ) >>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to >>>> ebi5-220 >>>> >>>> >>>> =============================================== >>>> *EXAMPLE 2*: >>>> >>>> *EBI5-038* >>>> Tue Aug 19 11:32:34.227 2014: *Disk lease period >>>> expired in cluster GSS.ebi.ac.uk >>>> . Attempting to reacquire lease.* >>>> Tue Aug 19 11:33:34.258 2014: *Lease is overdue. >>>> Probing cluster GSS.ebi.ac.uk * >>>> Tue Aug 19 11:35:24.265 2014: Close connection to >>>> gss02a (Connection reset by peer). >>>> Attempting reconnect. >>>> Tue Aug 19 11:35:24.865 2014: Close connection to >>>> ebi5-014 (Connection reset by >>>> peer). Attempting reconnect. >>>> ... >>>> LOT MORE RESETS BY PEER >>>> ... >>>> Tue Aug 19 11:35:25.096 2014: Close connection to >>>> ebi5-167 (Connection reset by >>>> peer). Attempting reconnect. >>>> Tue Aug 19 11:35:25.267 2014: Connecting to >>>> gss02a >>>> Tue Aug 19 11:35:25.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:35:26.267 2014: Retry connection to >>>> gss02a >>>> Tue Aug 19 11:35:26.268 2014: Close connection to >>>> gss02a (Connection failed because >>>> destination is still processing previous node failure) >>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any >>>> quorum nodes during cluster probe. >>>> Tue Aug 19 11:36:24.277 2014: *Lost membership in >>>> cluster GSS.ebi.ac.uk . >>>> Unmounting file systems.* >>>> >>>> *GSS02a* >>>> Tue Aug 19 11:35:24.263 2014: Node >>>> (ebi5-038 in ebi-cluster.ebi.ac.uk >>>> ) *is being expelled >>>> because of an expired lease.* Pings sent: 60. Replies >>>> received: 60. >>>> >>>> >>>> >>>> >>>> In example 1 seems that an NSD was not repliyng to the client, but >>>> the servers seems working fine.. how can i trace better ( to solve) >>>> the problem? >>>> >>>> In example 2 it seems to me that for some reason the manager are >>>> not renewing the lease in time. when this happens , its not a >>>> single client. >>>> Loads of them fail to get the lease renewed. Why this is happening? >>>> how can i trace to the source of the problem? >>>> >>>> >>>> >>>> Thanks in advance for any tips. >>>> >>>> Regards, >>>> Salvatore >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss atgpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss atgpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Aug 21 14:19:33 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 21 Aug 2014 15:19:33 +0200 (CEST) Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5B62F.1060305@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> Message-ID: <1481989063.92260.1408627173332.open-xchange@oxbaltgw09.schlund.de> > Now, while its understandable that the destination file is locked for one of > the "cat", so have to wait If GPFS is posix compatible i do not understand why a cat should block the other cat completly meanings on a standard FS you can "cat" from many source to the same target. Of course the result is not predictable. >From this point of view i would expect that both "cat" would start writing immediately thus i would expect a GPFS bug. All imho. Hajo Note: You might test which the input_file in a different directory and i would test the behaviour if the output_file is on a local FS like /tmp. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Thu Aug 21 14:22:22 2014 From: viccornell at gmail.com (Vic Cornell) Date: Thu, 21 Aug 2014 14:22:22 +0100 Subject: [gpfsug-discuss] gpfs client expels In-Reply-To: <53F5F19B.1010603@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> <53F454E3.40803@ebi.ac.uk> <53F5ABD7.80107@gpfsug.org> <53F5B62F.1060305@ebi.ac.uk> <9B247872-CD75-4F86-A10E-33AAB6BD414A@gmail.com> <53F5F19B.1010603@ebi.ac.uk> Message-ID: <0F03996A-2008-4076-9A2B-B4B2BB89E959@gmail.com> For my system I always use a dedicated admin network - as described in the gpfs manuals - for a gpfs cluster on 10/40GbE where the system will be heavily loaded. The difference in the stability of the system is very noticeable. Not sure how/if this would work on GSS - IBM ought to know :-) Vic On 21 Aug 2014, at 14:18, Salvatore Di Nardo wrote: > This is an interesting point! > > We use ethernet ( 10g links on the clients) but we dont have a separate network for the admin network. > > Could you explain this a bit further, because the clients and the servers we have are on different subnet so the packet are routed.. I don't see a practical way to separate them. The clients are blades in a chassis so even if i create 2 interfaces, they will physically use the came "cable" to go to the first switch. even the clients ( 600 clients) have different subsets. > > I will forward this consideration to our network admin , so see if we can work on a dedicated network. > > thanks for your tip. > > Regards, > Salvatore > > > > > On 21/08/14 14:03, Vic Cornell wrote: >> Hi Salvatore, >> >> Are you using ethernet or infiniband as the GPFS interconnect to your clients? >> >> If 10/40GbE - do you have a separate admin network? >> >> I have seen behaviour similar to this where the storage traffic causes congestion and the "admin" traffic gets lost or delayed causing expels. >> >> Vic >> >> >> >> On 21 Aug 2014, at 10:04, Salvatore Di Nardo wrote: >> >>> Thanks for the feedback, but we managed to find a scenario that excludes network problems. >>> >>> we have a file called input_file of nearly 100GB: >>> >>> if from client A we do: >>> >>> cat input_file >> output_file >>> >>> it start copying.. and we see waiter goeg a bit up,secs but then they flushes back to 0, so we xcan say that the copy proceed well... >>> >>> >>> if now we do the same from another client ( or just another shell on the same client) client B : >>> >>> cat input_file >> output_file >>> >>> >>> ( in other words we are trying to write to the same destination) all the waiters gets up until one node get expelled. >>> >>> >>> Now, while its understandable that the destination file is locked for one of the "cat", so have to wait ( and since the file is BIG , have to wait for a while), its not understandable why it stop the renewal lease. >>> Why its doen't return just a timeout error on the copy instead to expel the node? We can reproduce this every time, and since our users to operations like this on files over 100GB each you can imagine the result. >>> >>> >>> >>> As you can imagine even if its a bit silly to write at the same time to the same destination, its also quite common if we want to dump to a log file logs and for some reason one of the writers, write for a lot of time keeping the file locked. >>> Our expels are not due to network congestion, but because a write attempts have to wait another one. What i really dont understand is why to take a so expreme mesure to expell jest because a process is waiteing "to too much time". >>> >>> >>> I have ticket opened to IBM for this and the issue is under investigation, but no luck so far.. >>> >>> Regards, >>> Salvatore >>> >>> >>> >>> On 21/08/14 09:20, Jez Tucker (Chair) wrote: >>>> Hi there, >>>> >>>> I've seen the on several 'stock'? 'core'? GPFS system (we need a better term now GSS is out) and seen ping 'working', but alongside ejections from the cluster. >>>> The GPFS internode 'ping' is somewhat more circumspect than unix ping - and rightly so. >>>> >>>> In my experience this has _always_ been a network issue of one sort of another. If the network is experiencing issues, nodes will be ejected. >>>> Of course it could be unresponsive mmfsd or high loadavg, but I've seen that only twice in 10 years over many versions of GPFS. >>>> >>>> You need to follow the logs through from each machine in time order to determine who could not see who and in what order. >>>> Your best way forward is to log a SEV2 case with IBM support, directly or via your OEM and collect and supply a snap and traces as required by support. >>>> >>>> Without knowing your full setup, it's hard to help further. >>>> >>>> Jez >>>> >>>> On 20/08/14 08:57, Salvatore Di Nardo wrote: >>>>> Still problems. Here some more detailed examples: >>>>> >>>>> EXAMPLE 1: >>>>> EBI5-220 ( CLIENT) >>>>> Tue Aug 19 11:03:04.980 2014: Timed out waiting for a reply from node gss02b >>>>> Tue Aug 19 11:03:04.981 2014: Request sent to (gss02a in GSS.ebi.ac.uk) to expel (gss02b in GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:04.982 2014: This node will be expelled from cluster GSS.ebi.ac.uk due to expel msg from (ebi5-220) >>>>> Tue Aug 19 11:03:09.319 2014: Cluster Manager connection broke. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:03:10.322 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. File system: gpfs1 Reason: SGPanic >>>>> Tue Aug 19 11:03:12.066 2014: Connecting to gss02a >>>>> Tue Aug 19 11:03:12.070 2014: Connected to gss02a >>>>> Tue Aug 19 11:03:17.071 2014: Connecting to gss02b >>>>> Tue Aug 19 11:03:17.072 2014: Connecting to gss03b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss03a >>>>> Tue Aug 19 11:03:17.080 2014: Connecting to gss01b >>>>> Tue Aug 19 11:03:17.079 2014: Connecting to gss01a >>>>> Tue Aug 19 11:04:23.105 2014: Connected to gss02b >>>>> Tue Aug 19 11:04:23.107 2014: Connected to gss03b >>>>> Tue Aug 19 11:04:23.112 2014: Connected to gss03a >>>>> Tue Aug 19 11:04:23.115 2014: Connected to gss01b >>>>> Tue Aug 19 11:04:23.121 2014: Connected to gss01a >>>>> Tue Aug 19 11:12:28.992 2014: Node (gss02a in GSS.ebi.ac.uk) is now the Group Leader. >>>>> >>>>> GSS02B ( NSD SERVER) >>>>> ... >>>>> Tue Aug 19 11:03:17.070 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:25.016 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:28.080 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:36.019 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:39.083 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:47.023 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:50.088 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:52.218 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:03:58.030 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:01.092 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:03.220 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:09.034 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:12.096 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:14.224 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:20.037 2014: Killing connection from because the group is not ready for it to rejoin, err 46 >>>>> Tue Aug 19 11:04:23.103 2014: Accepted and connected to ebi5-220 >>>>> ... >>>>> >>>>> GSS02a ( NSD SERVER) >>>>> Tue Aug 19 11:03:04.980 2014: Expel (gss02b) request from (ebi5-220 in ebi-cluster.ebi.ac.uk). Expelling: (ebi5-220 in ebi-cluster.ebi.ac.uk) >>>>> Tue Aug 19 11:03:12.069 2014: Accepted and connected to ebi5-220 >>>>> >>>>> >>>>> =============================================== >>>>> EXAMPLE 2: >>>>> >>>>> EBI5-038 >>>>> Tue Aug 19 11:32:34.227 2014: Disk lease period expired in cluster GSS.ebi.ac.uk. Attempting to reacquire lease. >>>>> Tue Aug 19 11:33:34.258 2014: Lease is overdue. Probing cluster GSS.ebi.ac.uk >>>>> Tue Aug 19 11:35:24.265 2014: Close connection to gss02a (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:24.865 2014: Close connection to ebi5-014 (Connection reset by peer). Attempting reconnect. >>>>> ... >>>>> LOT MORE RESETS BY PEER >>>>> ... >>>>> Tue Aug 19 11:35:25.096 2014: Close connection to ebi5-167 (Connection reset by peer). Attempting reconnect. >>>>> Tue Aug 19 11:35:25.267 2014: Connecting to gss02a >>>>> Tue Aug 19 11:35:25.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:35:26.267 2014: Retry connection to gss02a >>>>> Tue Aug 19 11:35:26.268 2014: Close connection to gss02a (Connection failed because destination is still processing previous node failure) >>>>> Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum nodes during cluster probe. >>>>> Tue Aug 19 11:36:24.277 2014: Lost membership in cluster GSS.ebi.ac.uk. Unmounting file systems. >>>>> >>>>> GSS02a >>>>> Tue Aug 19 11:35:24.263 2014: Node (ebi5-038 in ebi-cluster.ebi.ac.uk) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. >>>>> >>>>> >>>>> >>>>> In example 1 seems that an NSD was not repliyng to the client, but the servers seems working fine.. how can i trace better ( to solve) the problem? >>>>> >>>>> In example 2 it seems to me that for some reason the manager are not renewing the lease in time. when this happens , its not a single client. >>>>> Loads of them fail to get the lease renewed. Why this is happening? how can i trace to the source of the problem? >>>>> >>>>> >>>>> >>>>> Thanks in advance for any tips. >>>>> >>>>> Regards, >>>>> Salvatore >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Fri Aug 22 10:37:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 22 Aug 2014 10:37:42 +0100 Subject: [gpfsug-discuss] gpfs client expels, fs hangind and waiters In-Reply-To: <53EE0BB1.8000005@ebi.ac.uk> References: <53EE0BB1.8000005@ebi.ac.uk> Message-ID: <53F70F66.2010405@ebi.ac.uk> Hello everyone, Just to let you know, we found the cause of our problems. We discovered that not all of the recommend kernel setting was configured on the clients ( on server was everything ok, but the clients had some setting missing ), and IBM support pointed to this document that describes perfectly our issues and the fix wich suggest to raise some parameters even higher than the standard "best practice" : http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5091222 Thanks to everyone for the replies. Regards, Salvatore From ewahl at osc.edu Mon Aug 25 19:55:08 2014 From: ewahl at osc.edu (Ed Wahl) Date: Mon, 25 Aug 2014 18:55:08 +0000 Subject: [gpfsug-discuss] CNFS using NFS over RDMA? Message-ID: Anyone out there doing CNFS with NFS over RDMA? Is this even possible? We currently have been delivering some CNFS services using TCP over IB, but that layer tends to have a large number of bugs all the time. Like to take a look at moving back down to verbs... Ed Wahl OSC -------------- next part -------------- An HTML attachment was scrubbed... URL: