From bdeluca at gmail.com Wed Apr 3 10:57:05 2013 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 3 Apr 2013 10:57:05 +0100 Subject: [gpfsug-discuss] mmbackup and management classes Message-ID: Hi gpfsusers, My first post to the list, Hi! We tsm for our backups of our gpfs filesystems, we are looking at using the mmbackup for script for launching our backups. >From conversations with other people we hear that support for management classes may not be completely available in mmbackup? I wondered if any one could comment on using mmbackup, and what and what not is supported. Any gotchas? -bd -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Wed Apr 3 13:04:47 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Wed, 3 Apr 2013 16:04:47 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/08/2013) Message-ID: I am out of the office until 04/08/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02. This is the only notification you will receive while this person is away. From chris_stone at uk.ibm.com Wed Apr 3 16:08:39 2013 From: chris_stone at uk.ibm.com (Chris Stone) Date: Wed, 3 Apr 2013 16:08:39 +0100 Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office until 16/08/2004. (returning 11/04/2013) Message-ID: I am out of the office until 11/04/2013. In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017 418 Note: This is an automated response to your message "[gpfsug-discuss] mmbackup and management classes" sent on 03/04/2013 10:57:05. This is the only notification you will receive while this person is away. From ANDREWD at uk.ibm.com Wed Apr 3 16:10:26 2013 From: ANDREWD at uk.ibm.com (Andrew Downes1) Date: Wed, 3 Apr 2013 16:10:26 +0100 Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office (returning 08/04/2013) Message-ID: I am out of the office until 08/04/2013. If anything is too urgent to wait for my return please contact Matt Ayres mailto:m_ayres at uk.ibm.com 44-7710-981527 In case of urgency, please contact our manager Dave Shave-Wall mailto:dave_shavewall at uk.ibm.com 44-7740-921623 Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02. This is the only notification you will receive while this person is away. From ashish.thandavan at cs.ox.ac.uk Thu Apr 11 10:58:41 2013 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Thu, 11 Apr 2013 10:58:41 +0100 Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged In-Reply-To: References: Message-ID: <51668951.7040506@cs.ox.ac.uk> Dear Claire, I trust you are well! If there are any spaces left, could you please register me for the event? Thank you! Regards, Ash On 25/03/13 14:38, Claire Robson wrote: > > Dear All, > > The next meeting date is set for *Wednesday 24^th April* and will be > taking place at the fantastic Dolby Studios in London (Dolby Europe > Limited, 4--6 Soho Square, London W1D 3PZ). > > *Getting to Dolby Europe Limited, Soho Square, London* > > Leave the Tottenham Court Road tube station by the South Oxford Street > exit [Exit 1]. > > Turn left onto Oxford Street. > > After about 50m turn left into Soho Street. > > Turn right into Soho Square. > > 4-6 Soho Square is directly in front of you. > > Our tentative agenda is as follows: > > 10:30 Arrivals and refreshments > > 11:00 Introductions and committee updates > > Jez Tucker, Group Chair & Claire Robson, Group Secretary > > 11:05 GPFS OpenStack Integration > > Prasenhit Sarkar, IBM Almaden Research Labs > > GPFS FPO > > Dinesh Subhraveti, IBM Almaden Research Labs > > 11:45 SAMBA 4.0 & CTDB 2.0 > > Michael Adams, SAMBA Development Team > > 12:15 SAMBA & GPFS Integration > > Volker Lendecke, SAMBA Development Team > > 13:00 Lunch (Buffet provided) > > 14:00 GPFS Native RAID & LTFS > > Jim Roche, IBM > > 14:45 User Stories > > 15:45 Group discussion: Challenges, experiences and questions & > Committee matters > > Led by Jez Tucker, Group Chairperson > > 16:00 Close > > We will be starting at 11:00am and concluding at 4pm but some of the > speaker timings may alter slightly. I will be posting further details > on what the presentations cover over the coming week or so. > > We hope you can make it for what will be a really interesting day of > GPFS discussions. *Please register with me if you would like to > attend* -- registrations are based on a first come first served basis. > > Best regards, > > *Claire Robson* > > GPFS User Group Secreatry > > Tel: 0114 257 2200 > > Mob: 07508 033896 > > Fax: 0114 257 0022 > > Web: _www.gpfsug.org _ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 13:37:52 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 13:37:52 +0100 Subject: [gpfsug-discuss] CTDB woes Message-ID: <51680020.4040509@ed.ac.uk> Hi folks, We've long been using CTDB and Samba for our NAS service, servicing ~500 users. We've been suffering from some problems with the CTDB performance over the last few weeks, likely triggered either by an upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by additional users coming on with a new workload. We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from sernet). Before we roll back, we'd like to make sure we can't fix the problem and stick with Samba 3.6 (and we don't even know that a roll back would fix the issue). The symptoms are a complete freeze of the service for CIFS users for 10-60 seconds, and on the servers a corresponding spawning of large numbers of CTDB processes, which seem to be created in a "big bang", and then do what they do and exit in the subsequent 10-60 seconds. We also serve up NFS from the same ctdb-managed frontends, and GPFS from the cluster - and these are both fine throughout. This was happening 5-10 times per hour, not at exact intervals though. When we added a third node to the CTDB cluster, it "got worse", and when we dropped the CTDB cluster down to a single node and everything started behaving fine - which is where we are now. So, I've got a bunch of questions! - does anyone know why ctdb would be spawning these processes, and if there's anything we can do to stop it needing to do it? - has anyone done any more general performance / config optimisation of CTDB? And - more generally - does anyone else actually use ctdb/samba/gpfs on the scale of ~500 users or higher? If so - how do you find it? -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From Tobias.Kuebler at sva.de Fri Apr 12 14:03:58 2013 From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de) Date: Fri, 12 Apr 2013 15:03:58 +0200 Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?= =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?= Message-ID: Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend. Vielen Dank f?r Ihre Nachricht. Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet, ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu beantworten. In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen Vertriebsbeauftragten. Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht "[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52. Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend diese Person abwesend ist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 16:43:44 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 16:43:44 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: References: <51680020.4040509@ed.ac.uk> Message-ID: <51682BB0.7010507@ed.ac.uk> On 12/04/13 15:43, Bob Cregan wrote: > Hi Orlando, > We use ctdb/samba for CIFS, and CNFS for NFS > (GPFS version 3.4.0-13) . Current versions are > > ctdb - 1.0.99 > samba 3.5.15 > > Both compiled from source. We have about 300+ users normally. > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > We have had no issues with this setup apart from CNFS which had 2 or 3 > bad moments over the last year . These have gone away since we have > fixed a bug with our 10G NIC drivers (emulex cards , kernel module > be2net) which lead to occasional dropped packets for jumbo frames. There > have been no issues with samba/ctdb > > The only comment I can make is that during initial investigations into > an upgrade of samba to 3.6.x we discovered that the 3.6 code would not > compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) > with error messages like: > > configure: checking whether cluster support is available > checking for ctdb.h... yes > checking for ctdb_private.h... yes > checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes > checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no > configure: error: "cluster support not available: support for > SCHEDULE_FOR_DELETION control missing" > > > What occurs to me is that this message seems to indicate that it is > possible to run a ctdb version that is incompatible with samba 3.6. > That would imply that an upgrade to a higher version of ctdb might > help, of course it might not and make backing out harder. Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > > A compile against ctdb 2.0 works fine. We will soon be running in this > upgrade, but I'm waiting to see what the samba people say at the UG > meeting first! > It has to be said - the timing is good! Cheers, Orlando > > Thanks > > Bob > > > On 12 April 2013 13:37, Orlando Richards > wrote: > > Hi folks, ac > > We've long been using CTDB and Samba for our NAS service, servicing > ~500 users. We've been suffering from some problems with the CTDB > performance over the last few weeks, likely triggered either by an > upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), > or possibly by additional users coming on with a new workload. > > We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, > from sernet). Before we roll back, we'd like to make sure we can't > fix the problem and stick with Samba 3.6 (and we don't even know > that a roll back would fix the issue). > > The symptoms are a complete freeze of the service for CIFS users for > 10-60 seconds, and on the servers a corresponding spawning of large > numbers of CTDB processes, which seem to be created in a "big bang", > and then do what they do and exit in the subsequent 10-60 seconds. > > We also serve up NFS from the same ctdb-managed frontends, and GPFS > from the cluster - and these are both fine throughout. > > This was happening 5-10 times per hour, not at exact intervals > though. When we added a third node to the CTDB cluster, it "got > worse", and when we dropped the CTDB cluster down to a single node > and everything started behaving fine - which is where we are now. > > So, I've got a bunch of questions! > > - does anyone know why ctdb would be spawning these processes, and > if there's anything we can do to stop it needing to do it? > - has anyone done any more general performance / config > optimisation of CTDB? > > And - more generally - does anyone else actually use ctdb/samba/gpfs > on the scale of ~500 users or higher? If so - how do you find it? > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _________________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/__listinfo/gpfsug-discuss > > > > > > -- > > Bob Cregan > > Senior Storage Systems Administrator > > ACRC > > Bristol University > > Tel: +44 (0) 117 331 4406 > > skype: bobcregan > > Mobile: +44 (0) 7712388129 > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From viccornell at gmail.com Fri Apr 12 19:44:16 2013 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 12 Apr 2013 19:44:16 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <51682BB0.7010507@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Have you tried putting the ctdb files onto a separate gpfs filesystem? Vic Cornell viccornell at gmail.com On 12 Apr 2013, at 16:43, Orlando Richards wrote: > On 12/04/13 15:43, Bob Cregan wrote: >> Hi Orlando, >> We use ctdb/samba for CIFS, and CNFS for NFS >> (GPFS version 3.4.0-13) . Current versions are >> >> ctdb - 1.0.99 >> samba 3.5.15 >> >> Both compiled from source. We have about 300+ users normally. >> > > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > > >> We have had no issues with this setup apart from CNFS which had 2 or 3 >> bad moments over the last year . These have gone away since we have >> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >> be2net) which lead to occasional dropped packets for jumbo frames. There >> have been no issues with samba/ctdb >> >> The only comment I can make is that during initial investigations into >> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >> with error messages like: >> >> configure: checking whether cluster support is available >> checking for ctdb.h... yes >> checking for ctdb_private.h... yes >> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >> configure: error: "cluster support not available: support for >> SCHEDULE_FOR_DELETION control missing" >> >> >> What occurs to me is that this message seems to indicate that it is >> possible to run a ctdb version that is incompatible with samba 3.6. >> That would imply that an upgrade to a higher version of ctdb might >> help, of course it might not and make backing out harder. > > Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > >> >> A compile against ctdb 2.0 works fine. We will soon be running in this >> upgrade, but I'm waiting to see what the samba people say at the UG >> meeting first! >> > > It has to be said - the timing is good! > Cheers, > Orlando > >> >> Thanks >> >> Bob >> >> >> On 12 April 2013 13:37, Orlando Richards > > wrote: >> >> Hi folks, ac >> >> We've long been using CTDB and Samba for our NAS service, servicing >> ~500 users. We've been suffering from some problems with the CTDB >> performance over the last few weeks, likely triggered either by an >> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >> or possibly by additional users coming on with a new workload. >> >> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >> from sernet). Before we roll back, we'd like to make sure we can't >> fix the problem and stick with Samba 3.6 (and we don't even know >> that a roll back would fix the issue). >> >> The symptoms are a complete freeze of the service for CIFS users for >> 10-60 seconds, and on the servers a corresponding spawning of large >> numbers of CTDB processes, which seem to be created in a "big bang", >> and then do what they do and exit in the subsequent 10-60 seconds. >> >> We also serve up NFS from the same ctdb-managed frontends, and GPFS >> from the cluster - and these are both fine throughout. >> >> This was happening 5-10 times per hour, not at exact intervals >> though. When we added a third node to the CTDB cluster, it "got >> worse", and when we dropped the CTDB cluster down to a single node >> and everything started behaving fine - which is where we are now. >> >> So, I've got a bunch of questions! >> >> - does anyone know why ctdb would be spawning these processes, and >> if there's anything we can do to stop it needing to do it? >> - has anyone done any more general performance / config >> optimisation of CTDB? >> >> And - more generally - does anyone else actually use ctdb/samba/gpfs >> on the scale of ~500 users or higher? If so - how do you find it? >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> _________________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >> >> >> >> >> >> -- >> >> Bob Cregan >> >> Senior Storage Systems Administrator >> >> ACRC >> >> Bristol University >> >> Tel: +44 (0) 117 331 4406 >> >> skype: bobcregan >> >> Mobile: +44 (0) 7712388129 >> > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From orlando.richards at ed.ac.uk Mon Apr 15 10:54:39 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 15 Apr 2013 10:54:39 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Message-ID: <516BCE5F.8010309@ed.ac.uk> On 12/04/13 19:44, Vic Cornell wrote: > Have you tried putting the ctdb files onto a separate gpfs filesystem? No - but considered it. However, the only "live" CTDB file that sits on GPFS is the reclock file, which - I think - is only used as the heartbeat between nodes and for the recovery process. Now, there's mileage in insulating that, certainly, but I don't think that's what we're suffering from here. On a positive note - we took the steps this morning to re-initialise the ctdb databases from current data, and things seem to be stable today so far. Basically - shut down ctdb on all but one node. On all but that node, do: mv /var/ctdb/ /var/ctdb.save.date then start up ctdb on those nodes. Once they've come up, shut down ctdb on the last node, move /var/ctdb out the way, and restart. That brings them all up with freshly compacted databases. Also, from the samba-technical mailing list came the advice to use a more recent ctdb - specifically, 1.2.61. I've got that built and ready to go (and a rebuilt samba compiled against it too), but if things prove to be stable after today's compacting, then we will probably leave it at that and not deploy this. Interesting that 2.0 wasn't suggested for "stable", and that the current "dev" version is 2.1. For reference, here's the start of the thread: https://lists.samba.org/archive/samba-technical/2013-April/091525.html -- Orlando. > > On 12 Apr 2013, at 16:43, Orlando Richards wrote: > >> On 12/04/13 15:43, Bob Cregan wrote: >>> Hi Orlando, >>> We use ctdb/samba for CIFS, and CNFS for NFS >>> (GPFS version 3.4.0-13) . Current versions are >>> >>> ctdb - 1.0.99 >>> samba 3.5.15 >>> >>> Both compiled from source. We have about 300+ users normally. >>> >> >> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. >> >> >>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>> bad moments over the last year . These have gone away since we have >>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>> be2net) which lead to occasional dropped packets for jumbo frames. There >>> have been no issues with samba/ctdb >>> >>> The only comment I can make is that during initial investigations into >>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>> with error messages like: >>> >>> configure: checking whether cluster support is available >>> checking for ctdb.h... yes >>> checking for ctdb_private.h... yes >>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>> configure: error: "cluster support not available: support for >>> SCHEDULE_FOR_DELETION control missing" >>> >>> >>> What occurs to me is that this message seems to indicate that it is >>> possible to run a ctdb version that is incompatible with samba 3.6. >>> That would imply that an upgrade to a higher version of ctdb might >>> help, of course it might not and make backing out harder. >> >> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... >> >>> >>> A compile against ctdb 2.0 works fine. We will soon be running in this >>> upgrade, but I'm waiting to see what the samba people say at the UG >>> meeting first! >>> >> >> It has to be said - the timing is good! >> Cheers, >> Orlando >> >>> >>> Thanks >>> >>> Bob >>> >>> >>> On 12 April 2013 13:37, Orlando Richards >> > wrote: >>> >>> Hi folks, ac >>> >>> We've long been using CTDB and Samba for our NAS service, servicing >>> ~500 users. We've been suffering from some problems with the CTDB >>> performance over the last few weeks, likely triggered either by an >>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >>> or possibly by additional users coming on with a new workload. >>> >>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>> from sernet). Before we roll back, we'd like to make sure we can't >>> fix the problem and stick with Samba 3.6 (and we don't even know >>> that a roll back would fix the issue). >>> >>> The symptoms are a complete freeze of the service for CIFS users for >>> 10-60 seconds, and on the servers a corresponding spawning of large >>> numbers of CTDB processes, which seem to be created in a "big bang", >>> and then do what they do and exit in the subsequent 10-60 seconds. >>> >>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>> from the cluster - and these are both fine throughout. >>> >>> This was happening 5-10 times per hour, not at exact intervals >>> though. When we added a third node to the CTDB cluster, it "got >>> worse", and when we dropped the CTDB cluster down to a single node >>> and everything started behaving fine - which is where we are now. >>> >>> So, I've got a bunch of questions! >>> >>> - does anyone know why ctdb would be spawning these processes, and >>> if there's anything we can do to stop it needing to do it? >>> - has anyone done any more general performance / config >>> optimisation of CTDB? >>> >>> And - more generally - does anyone else actually use ctdb/samba/gpfs >>> on the scale of ~500 users or higher? If so - how do you find it? >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _________________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>> >>> >>> >>> >>> >>> -- >>> >>> Bob Cregan >>> >>> Senior Storage Systems Administrator >>> >>> ACRC >>> >>> Bristol University >>> >>> Tel: +44 (0) 117 331 4406 >>> >>> skype: bobcregan >>> >>> Mobile: +44 (0) 7712388129 >>> >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From crobson at ocf.co.uk Mon Apr 15 15:04:38 2013 From: crobson at ocf.co.uk (Claire Robson) Date: Mon, 15 Apr 2013 15:04:38 +0100 Subject: [gpfsug-discuss] Latest agenda and places still available Message-ID: Dear All, Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend. This is the latest agenda for the day: 10:30 Arrivals and refreshments 11:00 Introductions and committee updates Jez Tucker, Group Chair & Claire Robson, Group Secretary 11:05 GPFS FPO Dinesh Subhraveti, IBM Almaden Research Labs 12:00 SAMBA 4.0 & CTDB 2.0 Michael Adams, SAMBA Development Team 13:00 Lunch (Buffet provided) 13:45 GPFS OpenStack Integration Dinesh Subhraveti, IBM Almaden Research Labs 14:15 SAMBA & GPFS Integration Volker Lendecke, SAMBA Development Team 15:15 Refreshments break 15:30 GPFS Native RAID & LTFS Jim Roche, IBM 16:00 Group discussion: Questions & Committee matters Led by Jez Tucker, Group Chairperson 16:05 Close I look forward to seeing many of you next week. Kind regards, Claire Robson GPFS user group Secetary Tel: 0114 257 2200 Mob: 07508 033896 Web: www.gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Tue Apr 16 13:08:58 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Tue, 16 Apr 2013 16:08:58 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/29/2013) Message-ID: I am out of the office until 04/29/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02. This is the only notification you will receive while this person is away. From orlando.richards at ed.ac.uk Wed Apr 17 11:30:32 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Wed, 17 Apr 2013 11:30:32 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516BCE5F.8010309@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> Message-ID: <516E79C8.8090603@ed.ac.uk> Hi All - an update to this, After re-initialising the databases on Monday, things did seem to be running better, but ultimately we got back to suffering from spikes in ctdb processes and corresponding "pauses" in service. We fell back to a single node again for Tuesday (and things were stable once again), and this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was rebuilt against CTDB 1.2.61 headers). Things seem to be stable for now - more so than on Monday. For the record - one metric I'm watching is the number of ctdb processes running (this would spike to > 1000 under the failure conditions). It's currently sitting consistently at 3 processes, with occasional blips of 5-7 processes. -- Orlando On 15/04/13 10:54, Orlando Richards wrote: > On 12/04/13 19:44, Vic Cornell wrote: >> Have you tried putting the ctdb files onto a separate gpfs filesystem? > > No - but considered it. However, the only "live" CTDB file that sits on > GPFS is the reclock file, which - I think - is only used as the > heartbeat between nodes and for the recovery process. Now, there's > mileage in insulating that, certainly, but I don't think that's what > we're suffering from here. > > On a positive note - we took the steps this morning to re-initialise the > ctdb databases from current data, and things seem to be stable today so > far. > > Basically - shut down ctdb on all but one node. On all but that node, do: > mv /var/ctdb/ /var/ctdb.save.date > > then start up ctdb on those nodes. Once they've come up, shut down ctdb > on the last node, move /var/ctdb out the way, and restart. That brings > them all up with freshly compacted databases. > > Also, from the samba-technical mailing list came the advice to use a > more recent ctdb - specifically, 1.2.61. I've got that built and ready > to go (and a rebuilt samba compiled against it too), but if things prove > to be stable after today's compacting, then we will probably leave it at > that and not deploy this. > > Interesting that 2.0 wasn't suggested for "stable", and that the current > "dev" version is 2.1. > > For reference, here's the start of the thread: > https://lists.samba.org/archive/samba-technical/2013-April/091525.html > > -- > Orlando. > > > >> >> On 12 Apr 2013, at 16:43, Orlando Richards >> wrote: >> >>> On 12/04/13 15:43, Bob Cregan wrote: >>>> Hi Orlando, >>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>> (GPFS version 3.4.0-13) . Current versions are >>>> >>>> ctdb - 1.0.99 >>>> samba 3.5.15 >>>> >>>> Both compiled from source. We have about 300+ users normally. >>>> >>> >>> We have suspicions that 3.6 has put additional "chatter" into the >>> ctdb database stream, which has pushed us over the edge. Barry Evans >>> has found that the clustered locking databases, in particular, prove >>> to be a scalability/usability limit for ctdb. >>> >>> >>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>> bad moments over the last year . These have gone away since we have >>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>> There >>>> have been no issues with samba/ctdb >>>> >>>> The only comment I can make is that during initial investigations into >>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>> with error messages like: >>>> >>>> configure: checking whether cluster support is available >>>> checking for ctdb.h... yes >>>> checking for ctdb_private.h... yes >>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>> configure: error: "cluster support not available: support for >>>> SCHEDULE_FOR_DELETION control missing" >>>> >>>> >>>> What occurs to me is that this message seems to indicate that it is >>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>> That would imply that an upgrade to a higher version of ctdb might >>>> help, of course it might not and make backing out harder. >>> >>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>> The versioning in CTDB has proved hard for me to fathom... >>> >>>> >>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>> meeting first! >>>> >>> >>> It has to be said - the timing is good! >>> Cheers, >>> Orlando >>> >>>> >>>> Thanks >>>> >>>> Bob >>>> >>>> >>>> On 12 April 2013 13:37, Orlando Richards >>> > wrote: >>>> >>>> Hi folks, ac >>>> >>>> We've long been using CTDB and Samba for our NAS service, servicing >>>> ~500 users. We've been suffering from some problems with the CTDB >>>> performance over the last few weeks, likely triggered either by an >>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>> result), >>>> or possibly by additional users coming on with a new workload. >>>> >>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>> from sernet). Before we roll back, we'd like to make sure we can't >>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>> that a roll back would fix the issue). >>>> >>>> The symptoms are a complete freeze of the service for CIFS users >>>> for >>>> 10-60 seconds, and on the servers a corresponding spawning of large >>>> numbers of CTDB processes, which seem to be created in a "big >>>> bang", >>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>> >>>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>>> from the cluster - and these are both fine throughout. >>>> >>>> This was happening 5-10 times per hour, not at exact intervals >>>> though. When we added a third node to the CTDB cluster, it "got >>>> worse", and when we dropped the CTDB cluster down to a single node >>>> and everything started behaving fine - which is where we are now. >>>> >>>> So, I've got a bunch of questions! >>>> >>>> - does anyone know why ctdb would be spawning these processes, >>>> and >>>> if there's anything we can do to stop it needing to do it? >>>> - has anyone done any more general performance / config >>>> optimisation of CTDB? >>>> >>>> And - more generally - does anyone else actually use >>>> ctdb/samba/gpfs >>>> on the scale of ~500 users or higher? If so - how do you find it? >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _________________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Bob Cregan >>>> >>>> Senior Storage Systems Administrator >>>> >>>> ACRC >>>> >>>> Bristol University >>>> >>>> Tel: +44 (0) 117 331 4406 >>>> >>>> skype: bobcregan >>>> >>>> Mobile: +44 (0) 7712388129 >>>> >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From orlando.richards at ed.ac.uk Mon Apr 22 15:52:55 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 22 Apr 2013 15:52:55 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516E79C8.8090603@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk> Message-ID: <51754EC7.8000600@ed.ac.uk> On 17/04/13 11:30, Orlando Richards wrote: > Hi All - an update to this, > > After re-initialising the databases on Monday, things did seem to be > running better, but ultimately we got back to suffering from spikes in > ctdb processes and corresponding "pauses" in service. We fell back to a > single node again for Tuesday (and things were stable once again), and > this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was > rebuilt against CTDB 1.2.61 headers). > > Things seem to be stable for now - more so than on Monday. > > For the record - one metric I'm watching is the number of ctdb processes > running (this would spike to > 1000 under the failure conditions). It's > currently sitting consistently at 3 processes, with occasional blips of > 5-7 processes. > Hi all, Looks like things have been running fine since we upgraded ctdb last Wednesday, so I think it's safe to say that we've found a fix for our problem in CTDB 1.2.61. Thanks for all the input! If anyone wants more info, feel free to get in touch. -- Orlando > -- > Orlando > > > > > > On 15/04/13 10:54, Orlando Richards wrote: >> On 12/04/13 19:44, Vic Cornell wrote: >>> Have you tried putting the ctdb files onto a separate gpfs filesystem? >> >> No - but considered it. However, the only "live" CTDB file that sits on >> GPFS is the reclock file, which - I think - is only used as the >> heartbeat between nodes and for the recovery process. Now, there's >> mileage in insulating that, certainly, but I don't think that's what >> we're suffering from here. >> >> On a positive note - we took the steps this morning to re-initialise the >> ctdb databases from current data, and things seem to be stable today so >> far. >> >> Basically - shut down ctdb on all but one node. On all but that node, do: >> mv /var/ctdb/ /var/ctdb.save.date >> >> then start up ctdb on those nodes. Once they've come up, shut down ctdb >> on the last node, move /var/ctdb out the way, and restart. That brings >> them all up with freshly compacted databases. >> >> Also, from the samba-technical mailing list came the advice to use a >> more recent ctdb - specifically, 1.2.61. I've got that built and ready >> to go (and a rebuilt samba compiled against it too), but if things prove >> to be stable after today's compacting, then we will probably leave it at >> that and not deploy this. >> >> Interesting that 2.0 wasn't suggested for "stable", and that the current >> "dev" version is 2.1. >> >> For reference, here's the start of the thread: >> https://lists.samba.org/archive/samba-technical/2013-April/091525.html >> >> -- >> Orlando. >> >> >> >>> >>> On 12 Apr 2013, at 16:43, Orlando Richards >>> wrote: >>> >>>> On 12/04/13 15:43, Bob Cregan wrote: >>>>> Hi Orlando, >>>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>>> (GPFS version 3.4.0-13) . Current versions are >>>>> >>>>> ctdb - 1.0.99 >>>>> samba 3.5.15 >>>>> >>>>> Both compiled from source. We have about 300+ users normally. >>>>> >>>> >>>> We have suspicions that 3.6 has put additional "chatter" into the >>>> ctdb database stream, which has pushed us over the edge. Barry Evans >>>> has found that the clustered locking databases, in particular, prove >>>> to be a scalability/usability limit for ctdb. >>>> >>>> >>>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>>> bad moments over the last year . These have gone away since we have >>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>>> There >>>>> have been no issues with samba/ctdb >>>>> >>>>> The only comment I can make is that during initial investigations into >>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>>> with error messages like: >>>>> >>>>> configure: checking whether cluster support is available >>>>> checking for ctdb.h... yes >>>>> checking for ctdb_private.h... yes >>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>>> configure: error: "cluster support not available: support for >>>>> SCHEDULE_FOR_DELETION control missing" >>>>> >>>>> >>>>> What occurs to me is that this message seems to indicate that it is >>>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>>> That would imply that an upgrade to a higher version of ctdb might >>>>> help, of course it might not and make backing out harder. >>>> >>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>>> The versioning in CTDB has proved hard for me to fathom... >>>> >>>>> >>>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>>> meeting first! >>>>> >>>> >>>> It has to be said - the timing is good! >>>> Cheers, >>>> Orlando >>>> >>>>> >>>>> Thanks >>>>> >>>>> Bob >>>>> >>>>> >>>>> On 12 April 2013 13:37, Orlando Richards >>>> > wrote: >>>>> >>>>> Hi folks, ac >>>>> >>>>> We've long been using CTDB and Samba for our NAS service, >>>>> servicing >>>>> ~500 users. We've been suffering from some problems with the CTDB >>>>> performance over the last few weeks, likely triggered either by an >>>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>>> result), >>>>> or possibly by additional users coming on with a new workload. >>>>> >>>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>>> from sernet). Before we roll back, we'd like to make sure we can't >>>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>>> that a roll back would fix the issue). >>>>> >>>>> The symptoms are a complete freeze of the service for CIFS users >>>>> for >>>>> 10-60 seconds, and on the servers a corresponding spawning of >>>>> large >>>>> numbers of CTDB processes, which seem to be created in a "big >>>>> bang", >>>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>>> >>>>> We also serve up NFS from the same ctdb-managed frontends, and >>>>> GPFS >>>>> from the cluster - and these are both fine throughout. >>>>> >>>>> This was happening 5-10 times per hour, not at exact intervals >>>>> though. When we added a third node to the CTDB cluster, it "got >>>>> worse", and when we dropped the CTDB cluster down to a single node >>>>> and everything started behaving fine - which is where we are now. >>>>> >>>>> So, I've got a bunch of questions! >>>>> >>>>> - does anyone know why ctdb would be spawning these processes, >>>>> and >>>>> if there's anything we can do to stop it needing to do it? >>>>> - has anyone done any more general performance / config >>>>> optimisation of CTDB? >>>>> >>>>> And - more generally - does anyone else actually use >>>>> ctdb/samba/gpfs >>>>> on the scale of ~500 users or higher? If so - how do you find it? >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Dr Orlando Richards >>>>> Information Services >>>>> IT Infrastructure Division >>>>> Unix Section >>>>> Tel: 0131 650 4994 >>>>> >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> _________________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Bob Cregan >>>>> >>>>> Senior Storage Systems Administrator >>>>> >>>>> ACRC >>>>> >>>>> Bristol University >>>>> >>>>> Tel: +44 (0) 117 331 4406 >>>>> >>>>> skype: bobcregan >>>>> >>>>> Mobile: +44 (0) 7712388129 >>>>> >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 10:38:07 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 10:38:07 +0100 Subject: [gpfsug-discuss] Test cluster - some questions Message-ID: Hi all Good to see lots of you at the user group meeting yesterday. Great work, Jez! We're setting up a test cluster here at Realise, with a view to moving our main storage over from Gluster. We're running the test cluster on Isilon hardware ... a couple of 1920 nodes that we were using for home dirs. Each node has dual gigabit ethernet ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB RAM. All good stuff and should make a nice test rig. I have a few questions! 1. We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah to 3.5? 2. I'm having trouble assigning NSDs. I have a descfile which looks like: #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 but the command "mmcrnsd -F /tmp/descfile -v no" just craps out with mmcrnsd: Processing disk sdc1 mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license designation. mmcrnsd: Error found while checking disk descriptor /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 mmcrnsd: Command failed. Examine previous error messages to determine cause. Any help pointing me gently in the right direction would be much appreciated. :-) TIA -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Apr 25 10:48:30 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 25 Apr 2013 10:48:30 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: <5178FBEE.4070200@ed.ac.uk> On 25/04/13 10:38, Pete Smith wrote: > Hi all > > Good to see lots of you at the user group meeting yesterday. Great work, > Jez! > > We're setting up a test cluster here at Realise, with a view to moving > our main storage over from Gluster. > > We're running the test cluster on Isilon hardware ... a couple of 1920 > nodes that we were using for home dirs. Each node has dual gigabit > ethernet ports, and dual infiniband ports. Single dual-core Xeon proc > and and 4GB RAM. All good stuff and should make a nice test rig. > > I have a few questions! > > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? > 2. I'm having trouble assigning NSDs. I have a descfile which looks like: > > #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > > but the command > > "mmcrnsd -F /tmp/descfile -v no" > > just craps out with > > mmcrnsd: Processing disk sdc1 > mmcrnsd: Node gpfs001.realisestudio.com > does not have a GPFS server license > designation. > mmcrnsd: Error found while checking disk descriptor > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > mmcrnsd: Command failed. Examine previous error messages to determine > cause. > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > Any help pointing me gently in the right direction would be much > appreciated. :-) > > TIA > > -- > Pete Smith > DevOp/System Administrator > Realise Studio > 12/13 Poland Street, London W1F 8QB > T. +44 (0)20 7165 9644 > > realisestudio.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 11:05:36 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 11:05:36 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: <5178FBEE.4070200@ed.ac.uk> References: <5178FBEE.4070200@ed.ac.uk> Message-ID: Thanks Orlando. Much appreciated. On 25 April 2013 10:48, Orlando Richards wrote: > On 25/04/13 10:38, Pete Smith wrote: > >> Hi all >> >> Good to see lots of you at the user group meeting yesterday. Great work, >> Jez! >> >> We're setting up a test cluster here at Realise, with a view to moving >> our main storage over from Gluster. >> >> We're running the test cluster on Isilon hardware ... a couple of 1920 >> nodes that we were using for home dirs. Each node has dual gigabit >> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc >> and and 4GB RAM. All good stuff and should make a nice test rig. >> >> I have a few questions! >> >> 1. We're on centos6.4.x86_64. What's the easiest way to go from >> 3.3.blah to 3.5? >> 2. I'm having trouble assigning NSDs. I have a descfile which looks like: >> >> #DiskName:PrimaryServer:**BackupServer:DiskUsage:** >> FailureGroup:DesiredName:**StoragePool >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> >> but the command >> >> "mmcrnsd -F /tmp/descfile -v no" >> >> just craps out with >> >> mmcrnsd: Processing disk sdc1 >> mmcrnsd: Node gpfs001.realisestudio.com >> > >> does not have a GPFS server license >> designation. >> mmcrnsd: Error found while checking disk descriptor >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> mmcrnsd: Command failed. Examine previous error messages to determine >> cause. >> >> > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > > > Any help pointing me gently in the right direction would be much >> appreciated. :-) >> >> TIA >> >> -- >> Pete Smith >> DevOp/System Administrator >> Realise Studio >> 12/13 Poland Street, London W1F 8QB >> T. +44 (0)20 7165 9644 >> >> realisestudio.com >> >> >> ______________________________**_________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss >> >> > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > ______________________________**_________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/**listinfo/gpfsug-discuss > -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From pete at realisestudio.com Fri Apr 26 16:06:38 2013 From: pete at realisestudio.com (Pete Smith) Date: Fri, 26 Apr 2013 16:06:38 +0100 Subject: [gpfsug-discuss] GPS Native RAID on linux? Message-ID: Hi I thought from the presentation that this was available on linux ... but documentation seems to indicate IBM GSS only? -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuartb at 4gh.net Tue Apr 30 21:50:38 2013 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT) Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote: > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? We are in transition to 3.5 on our original GPFS installation. Two of four servers are now at GPFS 3.4.XX/CentOS 6.4. Two servers are still at 3.3.YY/CentOS 5.4. The compute nodes are all to 3.4.XX/CentOS 6.4. The data center is remotely located and it is a pain to get physical access. Once we get the last two nodes upgraded, we expect to go to GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS installation). My understanding is that you need to step through 3.4 during a migration from 3.3 to 3.5. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From bdeluca at gmail.com Wed Apr 3 10:57:05 2013 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 3 Apr 2013 10:57:05 +0100 Subject: [gpfsug-discuss] mmbackup and management classes Message-ID: Hi gpfsusers, My first post to the list, Hi! We tsm for our backups of our gpfs filesystems, we are looking at using the mmbackup for script for launching our backups. >From conversations with other people we hear that support for management classes may not be completely available in mmbackup? I wondered if any one could comment on using mmbackup, and what and what not is supported. Any gotchas? -bd -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Wed Apr 3 13:04:47 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Wed, 3 Apr 2013 16:04:47 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/08/2013) Message-ID: I am out of the office until 04/08/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02. This is the only notification you will receive while this person is away. From chris_stone at uk.ibm.com Wed Apr 3 16:08:39 2013 From: chris_stone at uk.ibm.com (Chris Stone) Date: Wed, 3 Apr 2013 16:08:39 +0100 Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office until 16/08/2004. (returning 11/04/2013) Message-ID: I am out of the office until 11/04/2013. In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017 418 Note: This is an automated response to your message "[gpfsug-discuss] mmbackup and management classes" sent on 03/04/2013 10:57:05. This is the only notification you will receive while this person is away. From ANDREWD at uk.ibm.com Wed Apr 3 16:10:26 2013 From: ANDREWD at uk.ibm.com (Andrew Downes1) Date: Wed, 3 Apr 2013 16:10:26 +0100 Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office (returning 08/04/2013) Message-ID: I am out of the office until 08/04/2013. If anything is too urgent to wait for my return please contact Matt Ayres mailto:m_ayres at uk.ibm.com 44-7710-981527 In case of urgency, please contact our manager Dave Shave-Wall mailto:dave_shavewall at uk.ibm.com 44-7740-921623 Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02. This is the only notification you will receive while this person is away. From ashish.thandavan at cs.ox.ac.uk Thu Apr 11 10:58:41 2013 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Thu, 11 Apr 2013 10:58:41 +0100 Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged In-Reply-To: References: Message-ID: <51668951.7040506@cs.ox.ac.uk> Dear Claire, I trust you are well! If there are any spaces left, could you please register me for the event? Thank you! Regards, Ash On 25/03/13 14:38, Claire Robson wrote: > > Dear All, > > The next meeting date is set for *Wednesday 24^th April* and will be > taking place at the fantastic Dolby Studios in London (Dolby Europe > Limited, 4--6 Soho Square, London W1D 3PZ). > > *Getting to Dolby Europe Limited, Soho Square, London* > > Leave the Tottenham Court Road tube station by the South Oxford Street > exit [Exit 1]. > > Turn left onto Oxford Street. > > After about 50m turn left into Soho Street. > > Turn right into Soho Square. > > 4-6 Soho Square is directly in front of you. > > Our tentative agenda is as follows: > > 10:30 Arrivals and refreshments > > 11:00 Introductions and committee updates > > Jez Tucker, Group Chair & Claire Robson, Group Secretary > > 11:05 GPFS OpenStack Integration > > Prasenhit Sarkar, IBM Almaden Research Labs > > GPFS FPO > > Dinesh Subhraveti, IBM Almaden Research Labs > > 11:45 SAMBA 4.0 & CTDB 2.0 > > Michael Adams, SAMBA Development Team > > 12:15 SAMBA & GPFS Integration > > Volker Lendecke, SAMBA Development Team > > 13:00 Lunch (Buffet provided) > > 14:00 GPFS Native RAID & LTFS > > Jim Roche, IBM > > 14:45 User Stories > > 15:45 Group discussion: Challenges, experiences and questions & > Committee matters > > Led by Jez Tucker, Group Chairperson > > 16:00 Close > > We will be starting at 11:00am and concluding at 4pm but some of the > speaker timings may alter slightly. I will be posting further details > on what the presentations cover over the coming week or so. > > We hope you can make it for what will be a really interesting day of > GPFS discussions. *Please register with me if you would like to > attend* -- registrations are based on a first come first served basis. > > Best regards, > > *Claire Robson* > > GPFS User Group Secreatry > > Tel: 0114 257 2200 > > Mob: 07508 033896 > > Fax: 0114 257 0022 > > Web: _www.gpfsug.org _ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 13:37:52 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 13:37:52 +0100 Subject: [gpfsug-discuss] CTDB woes Message-ID: <51680020.4040509@ed.ac.uk> Hi folks, We've long been using CTDB and Samba for our NAS service, servicing ~500 users. We've been suffering from some problems with the CTDB performance over the last few weeks, likely triggered either by an upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by additional users coming on with a new workload. We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from sernet). Before we roll back, we'd like to make sure we can't fix the problem and stick with Samba 3.6 (and we don't even know that a roll back would fix the issue). The symptoms are a complete freeze of the service for CIFS users for 10-60 seconds, and on the servers a corresponding spawning of large numbers of CTDB processes, which seem to be created in a "big bang", and then do what they do and exit in the subsequent 10-60 seconds. We also serve up NFS from the same ctdb-managed frontends, and GPFS from the cluster - and these are both fine throughout. This was happening 5-10 times per hour, not at exact intervals though. When we added a third node to the CTDB cluster, it "got worse", and when we dropped the CTDB cluster down to a single node and everything started behaving fine - which is where we are now. So, I've got a bunch of questions! - does anyone know why ctdb would be spawning these processes, and if there's anything we can do to stop it needing to do it? - has anyone done any more general performance / config optimisation of CTDB? And - more generally - does anyone else actually use ctdb/samba/gpfs on the scale of ~500 users or higher? If so - how do you find it? -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From Tobias.Kuebler at sva.de Fri Apr 12 14:03:58 2013 From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de) Date: Fri, 12 Apr 2013 15:03:58 +0200 Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?= =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?= Message-ID: Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend. Vielen Dank f?r Ihre Nachricht. Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet, ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu beantworten. In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen Vertriebsbeauftragten. Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht "[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52. Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend diese Person abwesend ist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 16:43:44 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 16:43:44 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: References: <51680020.4040509@ed.ac.uk> Message-ID: <51682BB0.7010507@ed.ac.uk> On 12/04/13 15:43, Bob Cregan wrote: > Hi Orlando, > We use ctdb/samba for CIFS, and CNFS for NFS > (GPFS version 3.4.0-13) . Current versions are > > ctdb - 1.0.99 > samba 3.5.15 > > Both compiled from source. We have about 300+ users normally. > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > We have had no issues with this setup apart from CNFS which had 2 or 3 > bad moments over the last year . These have gone away since we have > fixed a bug with our 10G NIC drivers (emulex cards , kernel module > be2net) which lead to occasional dropped packets for jumbo frames. There > have been no issues with samba/ctdb > > The only comment I can make is that during initial investigations into > an upgrade of samba to 3.6.x we discovered that the 3.6 code would not > compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) > with error messages like: > > configure: checking whether cluster support is available > checking for ctdb.h... yes > checking for ctdb_private.h... yes > checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes > checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no > configure: error: "cluster support not available: support for > SCHEDULE_FOR_DELETION control missing" > > > What occurs to me is that this message seems to indicate that it is > possible to run a ctdb version that is incompatible with samba 3.6. > That would imply that an upgrade to a higher version of ctdb might > help, of course it might not and make backing out harder. Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > > A compile against ctdb 2.0 works fine. We will soon be running in this > upgrade, but I'm waiting to see what the samba people say at the UG > meeting first! > It has to be said - the timing is good! Cheers, Orlando > > Thanks > > Bob > > > On 12 April 2013 13:37, Orlando Richards > wrote: > > Hi folks, ac > > We've long been using CTDB and Samba for our NAS service, servicing > ~500 users. We've been suffering from some problems with the CTDB > performance over the last few weeks, likely triggered either by an > upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), > or possibly by additional users coming on with a new workload. > > We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, > from sernet). Before we roll back, we'd like to make sure we can't > fix the problem and stick with Samba 3.6 (and we don't even know > that a roll back would fix the issue). > > The symptoms are a complete freeze of the service for CIFS users for > 10-60 seconds, and on the servers a corresponding spawning of large > numbers of CTDB processes, which seem to be created in a "big bang", > and then do what they do and exit in the subsequent 10-60 seconds. > > We also serve up NFS from the same ctdb-managed frontends, and GPFS > from the cluster - and these are both fine throughout. > > This was happening 5-10 times per hour, not at exact intervals > though. When we added a third node to the CTDB cluster, it "got > worse", and when we dropped the CTDB cluster down to a single node > and everything started behaving fine - which is where we are now. > > So, I've got a bunch of questions! > > - does anyone know why ctdb would be spawning these processes, and > if there's anything we can do to stop it needing to do it? > - has anyone done any more general performance / config > optimisation of CTDB? > > And - more generally - does anyone else actually use ctdb/samba/gpfs > on the scale of ~500 users or higher? If so - how do you find it? > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _________________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/__listinfo/gpfsug-discuss > > > > > > -- > > Bob Cregan > > Senior Storage Systems Administrator > > ACRC > > Bristol University > > Tel: +44 (0) 117 331 4406 > > skype: bobcregan > > Mobile: +44 (0) 7712388129 > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From viccornell at gmail.com Fri Apr 12 19:44:16 2013 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 12 Apr 2013 19:44:16 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <51682BB0.7010507@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Have you tried putting the ctdb files onto a separate gpfs filesystem? Vic Cornell viccornell at gmail.com On 12 Apr 2013, at 16:43, Orlando Richards wrote: > On 12/04/13 15:43, Bob Cregan wrote: >> Hi Orlando, >> We use ctdb/samba for CIFS, and CNFS for NFS >> (GPFS version 3.4.0-13) . Current versions are >> >> ctdb - 1.0.99 >> samba 3.5.15 >> >> Both compiled from source. We have about 300+ users normally. >> > > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > > >> We have had no issues with this setup apart from CNFS which had 2 or 3 >> bad moments over the last year . These have gone away since we have >> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >> be2net) which lead to occasional dropped packets for jumbo frames. There >> have been no issues with samba/ctdb >> >> The only comment I can make is that during initial investigations into >> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >> with error messages like: >> >> configure: checking whether cluster support is available >> checking for ctdb.h... yes >> checking for ctdb_private.h... yes >> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >> configure: error: "cluster support not available: support for >> SCHEDULE_FOR_DELETION control missing" >> >> >> What occurs to me is that this message seems to indicate that it is >> possible to run a ctdb version that is incompatible with samba 3.6. >> That would imply that an upgrade to a higher version of ctdb might >> help, of course it might not and make backing out harder. > > Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > >> >> A compile against ctdb 2.0 works fine. We will soon be running in this >> upgrade, but I'm waiting to see what the samba people say at the UG >> meeting first! >> > > It has to be said - the timing is good! > Cheers, > Orlando > >> >> Thanks >> >> Bob >> >> >> On 12 April 2013 13:37, Orlando Richards > > wrote: >> >> Hi folks, ac >> >> We've long been using CTDB and Samba for our NAS service, servicing >> ~500 users. We've been suffering from some problems with the CTDB >> performance over the last few weeks, likely triggered either by an >> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >> or possibly by additional users coming on with a new workload. >> >> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >> from sernet). Before we roll back, we'd like to make sure we can't >> fix the problem and stick with Samba 3.6 (and we don't even know >> that a roll back would fix the issue). >> >> The symptoms are a complete freeze of the service for CIFS users for >> 10-60 seconds, and on the servers a corresponding spawning of large >> numbers of CTDB processes, which seem to be created in a "big bang", >> and then do what they do and exit in the subsequent 10-60 seconds. >> >> We also serve up NFS from the same ctdb-managed frontends, and GPFS >> from the cluster - and these are both fine throughout. >> >> This was happening 5-10 times per hour, not at exact intervals >> though. When we added a third node to the CTDB cluster, it "got >> worse", and when we dropped the CTDB cluster down to a single node >> and everything started behaving fine - which is where we are now. >> >> So, I've got a bunch of questions! >> >> - does anyone know why ctdb would be spawning these processes, and >> if there's anything we can do to stop it needing to do it? >> - has anyone done any more general performance / config >> optimisation of CTDB? >> >> And - more generally - does anyone else actually use ctdb/samba/gpfs >> on the scale of ~500 users or higher? If so - how do you find it? >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> _________________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >> >> >> >> >> >> -- >> >> Bob Cregan >> >> Senior Storage Systems Administrator >> >> ACRC >> >> Bristol University >> >> Tel: +44 (0) 117 331 4406 >> >> skype: bobcregan >> >> Mobile: +44 (0) 7712388129 >> > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From orlando.richards at ed.ac.uk Mon Apr 15 10:54:39 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 15 Apr 2013 10:54:39 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Message-ID: <516BCE5F.8010309@ed.ac.uk> On 12/04/13 19:44, Vic Cornell wrote: > Have you tried putting the ctdb files onto a separate gpfs filesystem? No - but considered it. However, the only "live" CTDB file that sits on GPFS is the reclock file, which - I think - is only used as the heartbeat between nodes and for the recovery process. Now, there's mileage in insulating that, certainly, but I don't think that's what we're suffering from here. On a positive note - we took the steps this morning to re-initialise the ctdb databases from current data, and things seem to be stable today so far. Basically - shut down ctdb on all but one node. On all but that node, do: mv /var/ctdb/ /var/ctdb.save.date then start up ctdb on those nodes. Once they've come up, shut down ctdb on the last node, move /var/ctdb out the way, and restart. That brings them all up with freshly compacted databases. Also, from the samba-technical mailing list came the advice to use a more recent ctdb - specifically, 1.2.61. I've got that built and ready to go (and a rebuilt samba compiled against it too), but if things prove to be stable after today's compacting, then we will probably leave it at that and not deploy this. Interesting that 2.0 wasn't suggested for "stable", and that the current "dev" version is 2.1. For reference, here's the start of the thread: https://lists.samba.org/archive/samba-technical/2013-April/091525.html -- Orlando. > > On 12 Apr 2013, at 16:43, Orlando Richards wrote: > >> On 12/04/13 15:43, Bob Cregan wrote: >>> Hi Orlando, >>> We use ctdb/samba for CIFS, and CNFS for NFS >>> (GPFS version 3.4.0-13) . Current versions are >>> >>> ctdb - 1.0.99 >>> samba 3.5.15 >>> >>> Both compiled from source. We have about 300+ users normally. >>> >> >> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. >> >> >>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>> bad moments over the last year . These have gone away since we have >>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>> be2net) which lead to occasional dropped packets for jumbo frames. There >>> have been no issues with samba/ctdb >>> >>> The only comment I can make is that during initial investigations into >>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>> with error messages like: >>> >>> configure: checking whether cluster support is available >>> checking for ctdb.h... yes >>> checking for ctdb_private.h... yes >>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>> configure: error: "cluster support not available: support for >>> SCHEDULE_FOR_DELETION control missing" >>> >>> >>> What occurs to me is that this message seems to indicate that it is >>> possible to run a ctdb version that is incompatible with samba 3.6. >>> That would imply that an upgrade to a higher version of ctdb might >>> help, of course it might not and make backing out harder. >> >> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... >> >>> >>> A compile against ctdb 2.0 works fine. We will soon be running in this >>> upgrade, but I'm waiting to see what the samba people say at the UG >>> meeting first! >>> >> >> It has to be said - the timing is good! >> Cheers, >> Orlando >> >>> >>> Thanks >>> >>> Bob >>> >>> >>> On 12 April 2013 13:37, Orlando Richards >> > wrote: >>> >>> Hi folks, ac >>> >>> We've long been using CTDB and Samba for our NAS service, servicing >>> ~500 users. We've been suffering from some problems with the CTDB >>> performance over the last few weeks, likely triggered either by an >>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >>> or possibly by additional users coming on with a new workload. >>> >>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>> from sernet). Before we roll back, we'd like to make sure we can't >>> fix the problem and stick with Samba 3.6 (and we don't even know >>> that a roll back would fix the issue). >>> >>> The symptoms are a complete freeze of the service for CIFS users for >>> 10-60 seconds, and on the servers a corresponding spawning of large >>> numbers of CTDB processes, which seem to be created in a "big bang", >>> and then do what they do and exit in the subsequent 10-60 seconds. >>> >>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>> from the cluster - and these are both fine throughout. >>> >>> This was happening 5-10 times per hour, not at exact intervals >>> though. When we added a third node to the CTDB cluster, it "got >>> worse", and when we dropped the CTDB cluster down to a single node >>> and everything started behaving fine - which is where we are now. >>> >>> So, I've got a bunch of questions! >>> >>> - does anyone know why ctdb would be spawning these processes, and >>> if there's anything we can do to stop it needing to do it? >>> - has anyone done any more general performance / config >>> optimisation of CTDB? >>> >>> And - more generally - does anyone else actually use ctdb/samba/gpfs >>> on the scale of ~500 users or higher? If so - how do you find it? >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _________________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>> >>> >>> >>> >>> >>> -- >>> >>> Bob Cregan >>> >>> Senior Storage Systems Administrator >>> >>> ACRC >>> >>> Bristol University >>> >>> Tel: +44 (0) 117 331 4406 >>> >>> skype: bobcregan >>> >>> Mobile: +44 (0) 7712388129 >>> >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From crobson at ocf.co.uk Mon Apr 15 15:04:38 2013 From: crobson at ocf.co.uk (Claire Robson) Date: Mon, 15 Apr 2013 15:04:38 +0100 Subject: [gpfsug-discuss] Latest agenda and places still available Message-ID: Dear All, Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend. This is the latest agenda for the day: 10:30 Arrivals and refreshments 11:00 Introductions and committee updates Jez Tucker, Group Chair & Claire Robson, Group Secretary 11:05 GPFS FPO Dinesh Subhraveti, IBM Almaden Research Labs 12:00 SAMBA 4.0 & CTDB 2.0 Michael Adams, SAMBA Development Team 13:00 Lunch (Buffet provided) 13:45 GPFS OpenStack Integration Dinesh Subhraveti, IBM Almaden Research Labs 14:15 SAMBA & GPFS Integration Volker Lendecke, SAMBA Development Team 15:15 Refreshments break 15:30 GPFS Native RAID & LTFS Jim Roche, IBM 16:00 Group discussion: Questions & Committee matters Led by Jez Tucker, Group Chairperson 16:05 Close I look forward to seeing many of you next week. Kind regards, Claire Robson GPFS user group Secetary Tel: 0114 257 2200 Mob: 07508 033896 Web: www.gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Tue Apr 16 13:08:58 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Tue, 16 Apr 2013 16:08:58 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/29/2013) Message-ID: I am out of the office until 04/29/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02. This is the only notification you will receive while this person is away. From orlando.richards at ed.ac.uk Wed Apr 17 11:30:32 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Wed, 17 Apr 2013 11:30:32 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516BCE5F.8010309@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> Message-ID: <516E79C8.8090603@ed.ac.uk> Hi All - an update to this, After re-initialising the databases on Monday, things did seem to be running better, but ultimately we got back to suffering from spikes in ctdb processes and corresponding "pauses" in service. We fell back to a single node again for Tuesday (and things were stable once again), and this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was rebuilt against CTDB 1.2.61 headers). Things seem to be stable for now - more so than on Monday. For the record - one metric I'm watching is the number of ctdb processes running (this would spike to > 1000 under the failure conditions). It's currently sitting consistently at 3 processes, with occasional blips of 5-7 processes. -- Orlando On 15/04/13 10:54, Orlando Richards wrote: > On 12/04/13 19:44, Vic Cornell wrote: >> Have you tried putting the ctdb files onto a separate gpfs filesystem? > > No - but considered it. However, the only "live" CTDB file that sits on > GPFS is the reclock file, which - I think - is only used as the > heartbeat between nodes and for the recovery process. Now, there's > mileage in insulating that, certainly, but I don't think that's what > we're suffering from here. > > On a positive note - we took the steps this morning to re-initialise the > ctdb databases from current data, and things seem to be stable today so > far. > > Basically - shut down ctdb on all but one node. On all but that node, do: > mv /var/ctdb/ /var/ctdb.save.date > > then start up ctdb on those nodes. Once they've come up, shut down ctdb > on the last node, move /var/ctdb out the way, and restart. That brings > them all up with freshly compacted databases. > > Also, from the samba-technical mailing list came the advice to use a > more recent ctdb - specifically, 1.2.61. I've got that built and ready > to go (and a rebuilt samba compiled against it too), but if things prove > to be stable after today's compacting, then we will probably leave it at > that and not deploy this. > > Interesting that 2.0 wasn't suggested for "stable", and that the current > "dev" version is 2.1. > > For reference, here's the start of the thread: > https://lists.samba.org/archive/samba-technical/2013-April/091525.html > > -- > Orlando. > > > >> >> On 12 Apr 2013, at 16:43, Orlando Richards >> wrote: >> >>> On 12/04/13 15:43, Bob Cregan wrote: >>>> Hi Orlando, >>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>> (GPFS version 3.4.0-13) . Current versions are >>>> >>>> ctdb - 1.0.99 >>>> samba 3.5.15 >>>> >>>> Both compiled from source. We have about 300+ users normally. >>>> >>> >>> We have suspicions that 3.6 has put additional "chatter" into the >>> ctdb database stream, which has pushed us over the edge. Barry Evans >>> has found that the clustered locking databases, in particular, prove >>> to be a scalability/usability limit for ctdb. >>> >>> >>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>> bad moments over the last year . These have gone away since we have >>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>> There >>>> have been no issues with samba/ctdb >>>> >>>> The only comment I can make is that during initial investigations into >>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>> with error messages like: >>>> >>>> configure: checking whether cluster support is available >>>> checking for ctdb.h... yes >>>> checking for ctdb_private.h... yes >>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>> configure: error: "cluster support not available: support for >>>> SCHEDULE_FOR_DELETION control missing" >>>> >>>> >>>> What occurs to me is that this message seems to indicate that it is >>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>> That would imply that an upgrade to a higher version of ctdb might >>>> help, of course it might not and make backing out harder. >>> >>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>> The versioning in CTDB has proved hard for me to fathom... >>> >>>> >>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>> meeting first! >>>> >>> >>> It has to be said - the timing is good! >>> Cheers, >>> Orlando >>> >>>> >>>> Thanks >>>> >>>> Bob >>>> >>>> >>>> On 12 April 2013 13:37, Orlando Richards >>> > wrote: >>>> >>>> Hi folks, ac >>>> >>>> We've long been using CTDB and Samba for our NAS service, servicing >>>> ~500 users. We've been suffering from some problems with the CTDB >>>> performance over the last few weeks, likely triggered either by an >>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>> result), >>>> or possibly by additional users coming on with a new workload. >>>> >>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>> from sernet). Before we roll back, we'd like to make sure we can't >>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>> that a roll back would fix the issue). >>>> >>>> The symptoms are a complete freeze of the service for CIFS users >>>> for >>>> 10-60 seconds, and on the servers a corresponding spawning of large >>>> numbers of CTDB processes, which seem to be created in a "big >>>> bang", >>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>> >>>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>>> from the cluster - and these are both fine throughout. >>>> >>>> This was happening 5-10 times per hour, not at exact intervals >>>> though. When we added a third node to the CTDB cluster, it "got >>>> worse", and when we dropped the CTDB cluster down to a single node >>>> and everything started behaving fine - which is where we are now. >>>> >>>> So, I've got a bunch of questions! >>>> >>>> - does anyone know why ctdb would be spawning these processes, >>>> and >>>> if there's anything we can do to stop it needing to do it? >>>> - has anyone done any more general performance / config >>>> optimisation of CTDB? >>>> >>>> And - more generally - does anyone else actually use >>>> ctdb/samba/gpfs >>>> on the scale of ~500 users or higher? If so - how do you find it? >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _________________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Bob Cregan >>>> >>>> Senior Storage Systems Administrator >>>> >>>> ACRC >>>> >>>> Bristol University >>>> >>>> Tel: +44 (0) 117 331 4406 >>>> >>>> skype: bobcregan >>>> >>>> Mobile: +44 (0) 7712388129 >>>> >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From orlando.richards at ed.ac.uk Mon Apr 22 15:52:55 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 22 Apr 2013 15:52:55 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516E79C8.8090603@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk> Message-ID: <51754EC7.8000600@ed.ac.uk> On 17/04/13 11:30, Orlando Richards wrote: > Hi All - an update to this, > > After re-initialising the databases on Monday, things did seem to be > running better, but ultimately we got back to suffering from spikes in > ctdb processes and corresponding "pauses" in service. We fell back to a > single node again for Tuesday (and things were stable once again), and > this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was > rebuilt against CTDB 1.2.61 headers). > > Things seem to be stable for now - more so than on Monday. > > For the record - one metric I'm watching is the number of ctdb processes > running (this would spike to > 1000 under the failure conditions). It's > currently sitting consistently at 3 processes, with occasional blips of > 5-7 processes. > Hi all, Looks like things have been running fine since we upgraded ctdb last Wednesday, so I think it's safe to say that we've found a fix for our problem in CTDB 1.2.61. Thanks for all the input! If anyone wants more info, feel free to get in touch. -- Orlando > -- > Orlando > > > > > > On 15/04/13 10:54, Orlando Richards wrote: >> On 12/04/13 19:44, Vic Cornell wrote: >>> Have you tried putting the ctdb files onto a separate gpfs filesystem? >> >> No - but considered it. However, the only "live" CTDB file that sits on >> GPFS is the reclock file, which - I think - is only used as the >> heartbeat between nodes and for the recovery process. Now, there's >> mileage in insulating that, certainly, but I don't think that's what >> we're suffering from here. >> >> On a positive note - we took the steps this morning to re-initialise the >> ctdb databases from current data, and things seem to be stable today so >> far. >> >> Basically - shut down ctdb on all but one node. On all but that node, do: >> mv /var/ctdb/ /var/ctdb.save.date >> >> then start up ctdb on those nodes. Once they've come up, shut down ctdb >> on the last node, move /var/ctdb out the way, and restart. That brings >> them all up with freshly compacted databases. >> >> Also, from the samba-technical mailing list came the advice to use a >> more recent ctdb - specifically, 1.2.61. I've got that built and ready >> to go (and a rebuilt samba compiled against it too), but if things prove >> to be stable after today's compacting, then we will probably leave it at >> that and not deploy this. >> >> Interesting that 2.0 wasn't suggested for "stable", and that the current >> "dev" version is 2.1. >> >> For reference, here's the start of the thread: >> https://lists.samba.org/archive/samba-technical/2013-April/091525.html >> >> -- >> Orlando. >> >> >> >>> >>> On 12 Apr 2013, at 16:43, Orlando Richards >>> wrote: >>> >>>> On 12/04/13 15:43, Bob Cregan wrote: >>>>> Hi Orlando, >>>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>>> (GPFS version 3.4.0-13) . Current versions are >>>>> >>>>> ctdb - 1.0.99 >>>>> samba 3.5.15 >>>>> >>>>> Both compiled from source. We have about 300+ users normally. >>>>> >>>> >>>> We have suspicions that 3.6 has put additional "chatter" into the >>>> ctdb database stream, which has pushed us over the edge. Barry Evans >>>> has found that the clustered locking databases, in particular, prove >>>> to be a scalability/usability limit for ctdb. >>>> >>>> >>>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>>> bad moments over the last year . These have gone away since we have >>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>>> There >>>>> have been no issues with samba/ctdb >>>>> >>>>> The only comment I can make is that during initial investigations into >>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>>> with error messages like: >>>>> >>>>> configure: checking whether cluster support is available >>>>> checking for ctdb.h... yes >>>>> checking for ctdb_private.h... yes >>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>>> configure: error: "cluster support not available: support for >>>>> SCHEDULE_FOR_DELETION control missing" >>>>> >>>>> >>>>> What occurs to me is that this message seems to indicate that it is >>>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>>> That would imply that an upgrade to a higher version of ctdb might >>>>> help, of course it might not and make backing out harder. >>>> >>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>>> The versioning in CTDB has proved hard for me to fathom... >>>> >>>>> >>>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>>> meeting first! >>>>> >>>> >>>> It has to be said - the timing is good! >>>> Cheers, >>>> Orlando >>>> >>>>> >>>>> Thanks >>>>> >>>>> Bob >>>>> >>>>> >>>>> On 12 April 2013 13:37, Orlando Richards >>>> > wrote: >>>>> >>>>> Hi folks, ac >>>>> >>>>> We've long been using CTDB and Samba for our NAS service, >>>>> servicing >>>>> ~500 users. We've been suffering from some problems with the CTDB >>>>> performance over the last few weeks, likely triggered either by an >>>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>>> result), >>>>> or possibly by additional users coming on with a new workload. >>>>> >>>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>>> from sernet). Before we roll back, we'd like to make sure we can't >>>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>>> that a roll back would fix the issue). >>>>> >>>>> The symptoms are a complete freeze of the service for CIFS users >>>>> for >>>>> 10-60 seconds, and on the servers a corresponding spawning of >>>>> large >>>>> numbers of CTDB processes, which seem to be created in a "big >>>>> bang", >>>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>>> >>>>> We also serve up NFS from the same ctdb-managed frontends, and >>>>> GPFS >>>>> from the cluster - and these are both fine throughout. >>>>> >>>>> This was happening 5-10 times per hour, not at exact intervals >>>>> though. When we added a third node to the CTDB cluster, it "got >>>>> worse", and when we dropped the CTDB cluster down to a single node >>>>> and everything started behaving fine - which is where we are now. >>>>> >>>>> So, I've got a bunch of questions! >>>>> >>>>> - does anyone know why ctdb would be spawning these processes, >>>>> and >>>>> if there's anything we can do to stop it needing to do it? >>>>> - has anyone done any more general performance / config >>>>> optimisation of CTDB? >>>>> >>>>> And - more generally - does anyone else actually use >>>>> ctdb/samba/gpfs >>>>> on the scale of ~500 users or higher? If so - how do you find it? >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Dr Orlando Richards >>>>> Information Services >>>>> IT Infrastructure Division >>>>> Unix Section >>>>> Tel: 0131 650 4994 >>>>> >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> _________________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Bob Cregan >>>>> >>>>> Senior Storage Systems Administrator >>>>> >>>>> ACRC >>>>> >>>>> Bristol University >>>>> >>>>> Tel: +44 (0) 117 331 4406 >>>>> >>>>> skype: bobcregan >>>>> >>>>> Mobile: +44 (0) 7712388129 >>>>> >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 10:38:07 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 10:38:07 +0100 Subject: [gpfsug-discuss] Test cluster - some questions Message-ID: Hi all Good to see lots of you at the user group meeting yesterday. Great work, Jez! We're setting up a test cluster here at Realise, with a view to moving our main storage over from Gluster. We're running the test cluster on Isilon hardware ... a couple of 1920 nodes that we were using for home dirs. Each node has dual gigabit ethernet ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB RAM. All good stuff and should make a nice test rig. I have a few questions! 1. We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah to 3.5? 2. I'm having trouble assigning NSDs. I have a descfile which looks like: #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 but the command "mmcrnsd -F /tmp/descfile -v no" just craps out with mmcrnsd: Processing disk sdc1 mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license designation. mmcrnsd: Error found while checking disk descriptor /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 mmcrnsd: Command failed. Examine previous error messages to determine cause. Any help pointing me gently in the right direction would be much appreciated. :-) TIA -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Apr 25 10:48:30 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 25 Apr 2013 10:48:30 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: <5178FBEE.4070200@ed.ac.uk> On 25/04/13 10:38, Pete Smith wrote: > Hi all > > Good to see lots of you at the user group meeting yesterday. Great work, > Jez! > > We're setting up a test cluster here at Realise, with a view to moving > our main storage over from Gluster. > > We're running the test cluster on Isilon hardware ... a couple of 1920 > nodes that we were using for home dirs. Each node has dual gigabit > ethernet ports, and dual infiniband ports. Single dual-core Xeon proc > and and 4GB RAM. All good stuff and should make a nice test rig. > > I have a few questions! > > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? > 2. I'm having trouble assigning NSDs. I have a descfile which looks like: > > #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > > but the command > > "mmcrnsd -F /tmp/descfile -v no" > > just craps out with > > mmcrnsd: Processing disk sdc1 > mmcrnsd: Node gpfs001.realisestudio.com > does not have a GPFS server license > designation. > mmcrnsd: Error found while checking disk descriptor > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > mmcrnsd: Command failed. Examine previous error messages to determine > cause. > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > Any help pointing me gently in the right direction would be much > appreciated. :-) > > TIA > > -- > Pete Smith > DevOp/System Administrator > Realise Studio > 12/13 Poland Street, London W1F 8QB > T. +44 (0)20 7165 9644 > > realisestudio.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 11:05:36 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 11:05:36 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: <5178FBEE.4070200@ed.ac.uk> References: <5178FBEE.4070200@ed.ac.uk> Message-ID: Thanks Orlando. Much appreciated. On 25 April 2013 10:48, Orlando Richards wrote: > On 25/04/13 10:38, Pete Smith wrote: > >> Hi all >> >> Good to see lots of you at the user group meeting yesterday. Great work, >> Jez! >> >> We're setting up a test cluster here at Realise, with a view to moving >> our main storage over from Gluster. >> >> We're running the test cluster on Isilon hardware ... a couple of 1920 >> nodes that we were using for home dirs. Each node has dual gigabit >> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc >> and and 4GB RAM. All good stuff and should make a nice test rig. >> >> I have a few questions! >> >> 1. We're on centos6.4.x86_64. What's the easiest way to go from >> 3.3.blah to 3.5? >> 2. I'm having trouble assigning NSDs. I have a descfile which looks like: >> >> #DiskName:PrimaryServer:**BackupServer:DiskUsage:** >> FailureGroup:DesiredName:**StoragePool >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> >> but the command >> >> "mmcrnsd -F /tmp/descfile -v no" >> >> just craps out with >> >> mmcrnsd: Processing disk sdc1 >> mmcrnsd: Node gpfs001.realisestudio.com >> > >> does not have a GPFS server license >> designation. >> mmcrnsd: Error found while checking disk descriptor >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> mmcrnsd: Command failed. Examine previous error messages to determine >> cause. >> >> > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > > > Any help pointing me gently in the right direction would be much >> appreciated. :-) >> >> TIA >> >> -- >> Pete Smith >> DevOp/System Administrator >> Realise Studio >> 12/13 Poland Street, London W1F 8QB >> T. +44 (0)20 7165 9644 >> >> realisestudio.com >> >> >> ______________________________**_________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss >> >> > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > ______________________________**_________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/**listinfo/gpfsug-discuss > -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From pete at realisestudio.com Fri Apr 26 16:06:38 2013 From: pete at realisestudio.com (Pete Smith) Date: Fri, 26 Apr 2013 16:06:38 +0100 Subject: [gpfsug-discuss] GPS Native RAID on linux? Message-ID: Hi I thought from the presentation that this was available on linux ... but documentation seems to indicate IBM GSS only? -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuartb at 4gh.net Tue Apr 30 21:50:38 2013 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT) Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote: > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? We are in transition to 3.5 on our original GPFS installation. Two of four servers are now at GPFS 3.4.XX/CentOS 6.4. Two servers are still at 3.3.YY/CentOS 5.4. The compute nodes are all to 3.4.XX/CentOS 6.4. The data center is remotely located and it is a pain to get physical access. Once we get the last two nodes upgraded, we expect to go to GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS installation). My understanding is that you need to step through 3.4 during a migration from 3.3 to 3.5. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From bdeluca at gmail.com Wed Apr 3 10:57:05 2013 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 3 Apr 2013 10:57:05 +0100 Subject: [gpfsug-discuss] mmbackup and management classes Message-ID: Hi gpfsusers, My first post to the list, Hi! We tsm for our backups of our gpfs filesystems, we are looking at using the mmbackup for script for launching our backups. >From conversations with other people we hear that support for management classes may not be completely available in mmbackup? I wondered if any one could comment on using mmbackup, and what and what not is supported. Any gotchas? -bd -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Wed Apr 3 13:04:47 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Wed, 3 Apr 2013 16:04:47 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/08/2013) Message-ID: I am out of the office until 04/08/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02. This is the only notification you will receive while this person is away. From chris_stone at uk.ibm.com Wed Apr 3 16:08:39 2013 From: chris_stone at uk.ibm.com (Chris Stone) Date: Wed, 3 Apr 2013 16:08:39 +0100 Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office until 16/08/2004. (returning 11/04/2013) Message-ID: I am out of the office until 11/04/2013. In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017 418 Note: This is an automated response to your message "[gpfsug-discuss] mmbackup and management classes" sent on 03/04/2013 10:57:05. This is the only notification you will receive while this person is away. From ANDREWD at uk.ibm.com Wed Apr 3 16:10:26 2013 From: ANDREWD at uk.ibm.com (Andrew Downes1) Date: Wed, 3 Apr 2013 16:10:26 +0100 Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office (returning 08/04/2013) Message-ID: I am out of the office until 08/04/2013. If anything is too urgent to wait for my return please contact Matt Ayres mailto:m_ayres at uk.ibm.com 44-7710-981527 In case of urgency, please contact our manager Dave Shave-Wall mailto:dave_shavewall at uk.ibm.com 44-7740-921623 Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02. This is the only notification you will receive while this person is away. From ashish.thandavan at cs.ox.ac.uk Thu Apr 11 10:58:41 2013 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Thu, 11 Apr 2013 10:58:41 +0100 Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged In-Reply-To: References: Message-ID: <51668951.7040506@cs.ox.ac.uk> Dear Claire, I trust you are well! If there are any spaces left, could you please register me for the event? Thank you! Regards, Ash On 25/03/13 14:38, Claire Robson wrote: > > Dear All, > > The next meeting date is set for *Wednesday 24^th April* and will be > taking place at the fantastic Dolby Studios in London (Dolby Europe > Limited, 4--6 Soho Square, London W1D 3PZ). > > *Getting to Dolby Europe Limited, Soho Square, London* > > Leave the Tottenham Court Road tube station by the South Oxford Street > exit [Exit 1]. > > Turn left onto Oxford Street. > > After about 50m turn left into Soho Street. > > Turn right into Soho Square. > > 4-6 Soho Square is directly in front of you. > > Our tentative agenda is as follows: > > 10:30 Arrivals and refreshments > > 11:00 Introductions and committee updates > > Jez Tucker, Group Chair & Claire Robson, Group Secretary > > 11:05 GPFS OpenStack Integration > > Prasenhit Sarkar, IBM Almaden Research Labs > > GPFS FPO > > Dinesh Subhraveti, IBM Almaden Research Labs > > 11:45 SAMBA 4.0 & CTDB 2.0 > > Michael Adams, SAMBA Development Team > > 12:15 SAMBA & GPFS Integration > > Volker Lendecke, SAMBA Development Team > > 13:00 Lunch (Buffet provided) > > 14:00 GPFS Native RAID & LTFS > > Jim Roche, IBM > > 14:45 User Stories > > 15:45 Group discussion: Challenges, experiences and questions & > Committee matters > > Led by Jez Tucker, Group Chairperson > > 16:00 Close > > We will be starting at 11:00am and concluding at 4pm but some of the > speaker timings may alter slightly. I will be posting further details > on what the presentations cover over the coming week or so. > > We hope you can make it for what will be a really interesting day of > GPFS discussions. *Please register with me if you would like to > attend* -- registrations are based on a first come first served basis. > > Best regards, > > *Claire Robson* > > GPFS User Group Secreatry > > Tel: 0114 257 2200 > > Mob: 07508 033896 > > Fax: 0114 257 0022 > > Web: _www.gpfsug.org _ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 13:37:52 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 13:37:52 +0100 Subject: [gpfsug-discuss] CTDB woes Message-ID: <51680020.4040509@ed.ac.uk> Hi folks, We've long been using CTDB and Samba for our NAS service, servicing ~500 users. We've been suffering from some problems with the CTDB performance over the last few weeks, likely triggered either by an upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by additional users coming on with a new workload. We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from sernet). Before we roll back, we'd like to make sure we can't fix the problem and stick with Samba 3.6 (and we don't even know that a roll back would fix the issue). The symptoms are a complete freeze of the service for CIFS users for 10-60 seconds, and on the servers a corresponding spawning of large numbers of CTDB processes, which seem to be created in a "big bang", and then do what they do and exit in the subsequent 10-60 seconds. We also serve up NFS from the same ctdb-managed frontends, and GPFS from the cluster - and these are both fine throughout. This was happening 5-10 times per hour, not at exact intervals though. When we added a third node to the CTDB cluster, it "got worse", and when we dropped the CTDB cluster down to a single node and everything started behaving fine - which is where we are now. So, I've got a bunch of questions! - does anyone know why ctdb would be spawning these processes, and if there's anything we can do to stop it needing to do it? - has anyone done any more general performance / config optimisation of CTDB? And - more generally - does anyone else actually use ctdb/samba/gpfs on the scale of ~500 users or higher? If so - how do you find it? -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From Tobias.Kuebler at sva.de Fri Apr 12 14:03:58 2013 From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de) Date: Fri, 12 Apr 2013 15:03:58 +0200 Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?= =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?= Message-ID: Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend. Vielen Dank f?r Ihre Nachricht. Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet, ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu beantworten. In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen Vertriebsbeauftragten. Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht "[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52. Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend diese Person abwesend ist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 16:43:44 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 16:43:44 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: References: <51680020.4040509@ed.ac.uk> Message-ID: <51682BB0.7010507@ed.ac.uk> On 12/04/13 15:43, Bob Cregan wrote: > Hi Orlando, > We use ctdb/samba for CIFS, and CNFS for NFS > (GPFS version 3.4.0-13) . Current versions are > > ctdb - 1.0.99 > samba 3.5.15 > > Both compiled from source. We have about 300+ users normally. > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > We have had no issues with this setup apart from CNFS which had 2 or 3 > bad moments over the last year . These have gone away since we have > fixed a bug with our 10G NIC drivers (emulex cards , kernel module > be2net) which lead to occasional dropped packets for jumbo frames. There > have been no issues with samba/ctdb > > The only comment I can make is that during initial investigations into > an upgrade of samba to 3.6.x we discovered that the 3.6 code would not > compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) > with error messages like: > > configure: checking whether cluster support is available > checking for ctdb.h... yes > checking for ctdb_private.h... yes > checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes > checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no > configure: error: "cluster support not available: support for > SCHEDULE_FOR_DELETION control missing" > > > What occurs to me is that this message seems to indicate that it is > possible to run a ctdb version that is incompatible with samba 3.6. > That would imply that an upgrade to a higher version of ctdb might > help, of course it might not and make backing out harder. Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > > A compile against ctdb 2.0 works fine. We will soon be running in this > upgrade, but I'm waiting to see what the samba people say at the UG > meeting first! > It has to be said - the timing is good! Cheers, Orlando > > Thanks > > Bob > > > On 12 April 2013 13:37, Orlando Richards > wrote: > > Hi folks, ac > > We've long been using CTDB and Samba for our NAS service, servicing > ~500 users. We've been suffering from some problems with the CTDB > performance over the last few weeks, likely triggered either by an > upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), > or possibly by additional users coming on with a new workload. > > We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, > from sernet). Before we roll back, we'd like to make sure we can't > fix the problem and stick with Samba 3.6 (and we don't even know > that a roll back would fix the issue). > > The symptoms are a complete freeze of the service for CIFS users for > 10-60 seconds, and on the servers a corresponding spawning of large > numbers of CTDB processes, which seem to be created in a "big bang", > and then do what they do and exit in the subsequent 10-60 seconds. > > We also serve up NFS from the same ctdb-managed frontends, and GPFS > from the cluster - and these are both fine throughout. > > This was happening 5-10 times per hour, not at exact intervals > though. When we added a third node to the CTDB cluster, it "got > worse", and when we dropped the CTDB cluster down to a single node > and everything started behaving fine - which is where we are now. > > So, I've got a bunch of questions! > > - does anyone know why ctdb would be spawning these processes, and > if there's anything we can do to stop it needing to do it? > - has anyone done any more general performance / config > optimisation of CTDB? > > And - more generally - does anyone else actually use ctdb/samba/gpfs > on the scale of ~500 users or higher? If so - how do you find it? > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _________________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/__listinfo/gpfsug-discuss > > > > > > -- > > Bob Cregan > > Senior Storage Systems Administrator > > ACRC > > Bristol University > > Tel: +44 (0) 117 331 4406 > > skype: bobcregan > > Mobile: +44 (0) 7712388129 > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From viccornell at gmail.com Fri Apr 12 19:44:16 2013 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 12 Apr 2013 19:44:16 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <51682BB0.7010507@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Have you tried putting the ctdb files onto a separate gpfs filesystem? Vic Cornell viccornell at gmail.com On 12 Apr 2013, at 16:43, Orlando Richards wrote: > On 12/04/13 15:43, Bob Cregan wrote: >> Hi Orlando, >> We use ctdb/samba for CIFS, and CNFS for NFS >> (GPFS version 3.4.0-13) . Current versions are >> >> ctdb - 1.0.99 >> samba 3.5.15 >> >> Both compiled from source. We have about 300+ users normally. >> > > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > > >> We have had no issues with this setup apart from CNFS which had 2 or 3 >> bad moments over the last year . These have gone away since we have >> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >> be2net) which lead to occasional dropped packets for jumbo frames. There >> have been no issues with samba/ctdb >> >> The only comment I can make is that during initial investigations into >> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >> with error messages like: >> >> configure: checking whether cluster support is available >> checking for ctdb.h... yes >> checking for ctdb_private.h... yes >> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >> configure: error: "cluster support not available: support for >> SCHEDULE_FOR_DELETION control missing" >> >> >> What occurs to me is that this message seems to indicate that it is >> possible to run a ctdb version that is incompatible with samba 3.6. >> That would imply that an upgrade to a higher version of ctdb might >> help, of course it might not and make backing out harder. > > Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > >> >> A compile against ctdb 2.0 works fine. We will soon be running in this >> upgrade, but I'm waiting to see what the samba people say at the UG >> meeting first! >> > > It has to be said - the timing is good! > Cheers, > Orlando > >> >> Thanks >> >> Bob >> >> >> On 12 April 2013 13:37, Orlando Richards > > wrote: >> >> Hi folks, ac >> >> We've long been using CTDB and Samba for our NAS service, servicing >> ~500 users. We've been suffering from some problems with the CTDB >> performance over the last few weeks, likely triggered either by an >> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >> or possibly by additional users coming on with a new workload. >> >> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >> from sernet). Before we roll back, we'd like to make sure we can't >> fix the problem and stick with Samba 3.6 (and we don't even know >> that a roll back would fix the issue). >> >> The symptoms are a complete freeze of the service for CIFS users for >> 10-60 seconds, and on the servers a corresponding spawning of large >> numbers of CTDB processes, which seem to be created in a "big bang", >> and then do what they do and exit in the subsequent 10-60 seconds. >> >> We also serve up NFS from the same ctdb-managed frontends, and GPFS >> from the cluster - and these are both fine throughout. >> >> This was happening 5-10 times per hour, not at exact intervals >> though. When we added a third node to the CTDB cluster, it "got >> worse", and when we dropped the CTDB cluster down to a single node >> and everything started behaving fine - which is where we are now. >> >> So, I've got a bunch of questions! >> >> - does anyone know why ctdb would be spawning these processes, and >> if there's anything we can do to stop it needing to do it? >> - has anyone done any more general performance / config >> optimisation of CTDB? >> >> And - more generally - does anyone else actually use ctdb/samba/gpfs >> on the scale of ~500 users or higher? If so - how do you find it? >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> _________________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >> >> >> >> >> >> -- >> >> Bob Cregan >> >> Senior Storage Systems Administrator >> >> ACRC >> >> Bristol University >> >> Tel: +44 (0) 117 331 4406 >> >> skype: bobcregan >> >> Mobile: +44 (0) 7712388129 >> > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From orlando.richards at ed.ac.uk Mon Apr 15 10:54:39 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 15 Apr 2013 10:54:39 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Message-ID: <516BCE5F.8010309@ed.ac.uk> On 12/04/13 19:44, Vic Cornell wrote: > Have you tried putting the ctdb files onto a separate gpfs filesystem? No - but considered it. However, the only "live" CTDB file that sits on GPFS is the reclock file, which - I think - is only used as the heartbeat between nodes and for the recovery process. Now, there's mileage in insulating that, certainly, but I don't think that's what we're suffering from here. On a positive note - we took the steps this morning to re-initialise the ctdb databases from current data, and things seem to be stable today so far. Basically - shut down ctdb on all but one node. On all but that node, do: mv /var/ctdb/ /var/ctdb.save.date then start up ctdb on those nodes. Once they've come up, shut down ctdb on the last node, move /var/ctdb out the way, and restart. That brings them all up with freshly compacted databases. Also, from the samba-technical mailing list came the advice to use a more recent ctdb - specifically, 1.2.61. I've got that built and ready to go (and a rebuilt samba compiled against it too), but if things prove to be stable after today's compacting, then we will probably leave it at that and not deploy this. Interesting that 2.0 wasn't suggested for "stable", and that the current "dev" version is 2.1. For reference, here's the start of the thread: https://lists.samba.org/archive/samba-technical/2013-April/091525.html -- Orlando. > > On 12 Apr 2013, at 16:43, Orlando Richards wrote: > >> On 12/04/13 15:43, Bob Cregan wrote: >>> Hi Orlando, >>> We use ctdb/samba for CIFS, and CNFS for NFS >>> (GPFS version 3.4.0-13) . Current versions are >>> >>> ctdb - 1.0.99 >>> samba 3.5.15 >>> >>> Both compiled from source. We have about 300+ users normally. >>> >> >> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. >> >> >>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>> bad moments over the last year . These have gone away since we have >>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>> be2net) which lead to occasional dropped packets for jumbo frames. There >>> have been no issues with samba/ctdb >>> >>> The only comment I can make is that during initial investigations into >>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>> with error messages like: >>> >>> configure: checking whether cluster support is available >>> checking for ctdb.h... yes >>> checking for ctdb_private.h... yes >>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>> configure: error: "cluster support not available: support for >>> SCHEDULE_FOR_DELETION control missing" >>> >>> >>> What occurs to me is that this message seems to indicate that it is >>> possible to run a ctdb version that is incompatible with samba 3.6. >>> That would imply that an upgrade to a higher version of ctdb might >>> help, of course it might not and make backing out harder. >> >> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... >> >>> >>> A compile against ctdb 2.0 works fine. We will soon be running in this >>> upgrade, but I'm waiting to see what the samba people say at the UG >>> meeting first! >>> >> >> It has to be said - the timing is good! >> Cheers, >> Orlando >> >>> >>> Thanks >>> >>> Bob >>> >>> >>> On 12 April 2013 13:37, Orlando Richards >> > wrote: >>> >>> Hi folks, ac >>> >>> We've long been using CTDB and Samba for our NAS service, servicing >>> ~500 users. We've been suffering from some problems with the CTDB >>> performance over the last few weeks, likely triggered either by an >>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >>> or possibly by additional users coming on with a new workload. >>> >>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>> from sernet). Before we roll back, we'd like to make sure we can't >>> fix the problem and stick with Samba 3.6 (and we don't even know >>> that a roll back would fix the issue). >>> >>> The symptoms are a complete freeze of the service for CIFS users for >>> 10-60 seconds, and on the servers a corresponding spawning of large >>> numbers of CTDB processes, which seem to be created in a "big bang", >>> and then do what they do and exit in the subsequent 10-60 seconds. >>> >>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>> from the cluster - and these are both fine throughout. >>> >>> This was happening 5-10 times per hour, not at exact intervals >>> though. When we added a third node to the CTDB cluster, it "got >>> worse", and when we dropped the CTDB cluster down to a single node >>> and everything started behaving fine - which is where we are now. >>> >>> So, I've got a bunch of questions! >>> >>> - does anyone know why ctdb would be spawning these processes, and >>> if there's anything we can do to stop it needing to do it? >>> - has anyone done any more general performance / config >>> optimisation of CTDB? >>> >>> And - more generally - does anyone else actually use ctdb/samba/gpfs >>> on the scale of ~500 users or higher? If so - how do you find it? >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _________________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>> >>> >>> >>> >>> >>> -- >>> >>> Bob Cregan >>> >>> Senior Storage Systems Administrator >>> >>> ACRC >>> >>> Bristol University >>> >>> Tel: +44 (0) 117 331 4406 >>> >>> skype: bobcregan >>> >>> Mobile: +44 (0) 7712388129 >>> >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From crobson at ocf.co.uk Mon Apr 15 15:04:38 2013 From: crobson at ocf.co.uk (Claire Robson) Date: Mon, 15 Apr 2013 15:04:38 +0100 Subject: [gpfsug-discuss] Latest agenda and places still available Message-ID: Dear All, Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend. This is the latest agenda for the day: 10:30 Arrivals and refreshments 11:00 Introductions and committee updates Jez Tucker, Group Chair & Claire Robson, Group Secretary 11:05 GPFS FPO Dinesh Subhraveti, IBM Almaden Research Labs 12:00 SAMBA 4.0 & CTDB 2.0 Michael Adams, SAMBA Development Team 13:00 Lunch (Buffet provided) 13:45 GPFS OpenStack Integration Dinesh Subhraveti, IBM Almaden Research Labs 14:15 SAMBA & GPFS Integration Volker Lendecke, SAMBA Development Team 15:15 Refreshments break 15:30 GPFS Native RAID & LTFS Jim Roche, IBM 16:00 Group discussion: Questions & Committee matters Led by Jez Tucker, Group Chairperson 16:05 Close I look forward to seeing many of you next week. Kind regards, Claire Robson GPFS user group Secetary Tel: 0114 257 2200 Mob: 07508 033896 Web: www.gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Tue Apr 16 13:08:58 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Tue, 16 Apr 2013 16:08:58 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/29/2013) Message-ID: I am out of the office until 04/29/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02. This is the only notification you will receive while this person is away. From orlando.richards at ed.ac.uk Wed Apr 17 11:30:32 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Wed, 17 Apr 2013 11:30:32 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516BCE5F.8010309@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> Message-ID: <516E79C8.8090603@ed.ac.uk> Hi All - an update to this, After re-initialising the databases on Monday, things did seem to be running better, but ultimately we got back to suffering from spikes in ctdb processes and corresponding "pauses" in service. We fell back to a single node again for Tuesday (and things were stable once again), and this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was rebuilt against CTDB 1.2.61 headers). Things seem to be stable for now - more so than on Monday. For the record - one metric I'm watching is the number of ctdb processes running (this would spike to > 1000 under the failure conditions). It's currently sitting consistently at 3 processes, with occasional blips of 5-7 processes. -- Orlando On 15/04/13 10:54, Orlando Richards wrote: > On 12/04/13 19:44, Vic Cornell wrote: >> Have you tried putting the ctdb files onto a separate gpfs filesystem? > > No - but considered it. However, the only "live" CTDB file that sits on > GPFS is the reclock file, which - I think - is only used as the > heartbeat between nodes and for the recovery process. Now, there's > mileage in insulating that, certainly, but I don't think that's what > we're suffering from here. > > On a positive note - we took the steps this morning to re-initialise the > ctdb databases from current data, and things seem to be stable today so > far. > > Basically - shut down ctdb on all but one node. On all but that node, do: > mv /var/ctdb/ /var/ctdb.save.date > > then start up ctdb on those nodes. Once they've come up, shut down ctdb > on the last node, move /var/ctdb out the way, and restart. That brings > them all up with freshly compacted databases. > > Also, from the samba-technical mailing list came the advice to use a > more recent ctdb - specifically, 1.2.61. I've got that built and ready > to go (and a rebuilt samba compiled against it too), but if things prove > to be stable after today's compacting, then we will probably leave it at > that and not deploy this. > > Interesting that 2.0 wasn't suggested for "stable", and that the current > "dev" version is 2.1. > > For reference, here's the start of the thread: > https://lists.samba.org/archive/samba-technical/2013-April/091525.html > > -- > Orlando. > > > >> >> On 12 Apr 2013, at 16:43, Orlando Richards >> wrote: >> >>> On 12/04/13 15:43, Bob Cregan wrote: >>>> Hi Orlando, >>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>> (GPFS version 3.4.0-13) . Current versions are >>>> >>>> ctdb - 1.0.99 >>>> samba 3.5.15 >>>> >>>> Both compiled from source. We have about 300+ users normally. >>>> >>> >>> We have suspicions that 3.6 has put additional "chatter" into the >>> ctdb database stream, which has pushed us over the edge. Barry Evans >>> has found that the clustered locking databases, in particular, prove >>> to be a scalability/usability limit for ctdb. >>> >>> >>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>> bad moments over the last year . These have gone away since we have >>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>> There >>>> have been no issues with samba/ctdb >>>> >>>> The only comment I can make is that during initial investigations into >>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>> with error messages like: >>>> >>>> configure: checking whether cluster support is available >>>> checking for ctdb.h... yes >>>> checking for ctdb_private.h... yes >>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>> configure: error: "cluster support not available: support for >>>> SCHEDULE_FOR_DELETION control missing" >>>> >>>> >>>> What occurs to me is that this message seems to indicate that it is >>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>> That would imply that an upgrade to a higher version of ctdb might >>>> help, of course it might not and make backing out harder. >>> >>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>> The versioning in CTDB has proved hard for me to fathom... >>> >>>> >>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>> meeting first! >>>> >>> >>> It has to be said - the timing is good! >>> Cheers, >>> Orlando >>> >>>> >>>> Thanks >>>> >>>> Bob >>>> >>>> >>>> On 12 April 2013 13:37, Orlando Richards >>> > wrote: >>>> >>>> Hi folks, ac >>>> >>>> We've long been using CTDB and Samba for our NAS service, servicing >>>> ~500 users. We've been suffering from some problems with the CTDB >>>> performance over the last few weeks, likely triggered either by an >>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>> result), >>>> or possibly by additional users coming on with a new workload. >>>> >>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>> from sernet). Before we roll back, we'd like to make sure we can't >>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>> that a roll back would fix the issue). >>>> >>>> The symptoms are a complete freeze of the service for CIFS users >>>> for >>>> 10-60 seconds, and on the servers a corresponding spawning of large >>>> numbers of CTDB processes, which seem to be created in a "big >>>> bang", >>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>> >>>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>>> from the cluster - and these are both fine throughout. >>>> >>>> This was happening 5-10 times per hour, not at exact intervals >>>> though. When we added a third node to the CTDB cluster, it "got >>>> worse", and when we dropped the CTDB cluster down to a single node >>>> and everything started behaving fine - which is where we are now. >>>> >>>> So, I've got a bunch of questions! >>>> >>>> - does anyone know why ctdb would be spawning these processes, >>>> and >>>> if there's anything we can do to stop it needing to do it? >>>> - has anyone done any more general performance / config >>>> optimisation of CTDB? >>>> >>>> And - more generally - does anyone else actually use >>>> ctdb/samba/gpfs >>>> on the scale of ~500 users or higher? If so - how do you find it? >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _________________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Bob Cregan >>>> >>>> Senior Storage Systems Administrator >>>> >>>> ACRC >>>> >>>> Bristol University >>>> >>>> Tel: +44 (0) 117 331 4406 >>>> >>>> skype: bobcregan >>>> >>>> Mobile: +44 (0) 7712388129 >>>> >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From orlando.richards at ed.ac.uk Mon Apr 22 15:52:55 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 22 Apr 2013 15:52:55 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516E79C8.8090603@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk> Message-ID: <51754EC7.8000600@ed.ac.uk> On 17/04/13 11:30, Orlando Richards wrote: > Hi All - an update to this, > > After re-initialising the databases on Monday, things did seem to be > running better, but ultimately we got back to suffering from spikes in > ctdb processes and corresponding "pauses" in service. We fell back to a > single node again for Tuesday (and things were stable once again), and > this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was > rebuilt against CTDB 1.2.61 headers). > > Things seem to be stable for now - more so than on Monday. > > For the record - one metric I'm watching is the number of ctdb processes > running (this would spike to > 1000 under the failure conditions). It's > currently sitting consistently at 3 processes, with occasional blips of > 5-7 processes. > Hi all, Looks like things have been running fine since we upgraded ctdb last Wednesday, so I think it's safe to say that we've found a fix for our problem in CTDB 1.2.61. Thanks for all the input! If anyone wants more info, feel free to get in touch. -- Orlando > -- > Orlando > > > > > > On 15/04/13 10:54, Orlando Richards wrote: >> On 12/04/13 19:44, Vic Cornell wrote: >>> Have you tried putting the ctdb files onto a separate gpfs filesystem? >> >> No - but considered it. However, the only "live" CTDB file that sits on >> GPFS is the reclock file, which - I think - is only used as the >> heartbeat between nodes and for the recovery process. Now, there's >> mileage in insulating that, certainly, but I don't think that's what >> we're suffering from here. >> >> On a positive note - we took the steps this morning to re-initialise the >> ctdb databases from current data, and things seem to be stable today so >> far. >> >> Basically - shut down ctdb on all but one node. On all but that node, do: >> mv /var/ctdb/ /var/ctdb.save.date >> >> then start up ctdb on those nodes. Once they've come up, shut down ctdb >> on the last node, move /var/ctdb out the way, and restart. That brings >> them all up with freshly compacted databases. >> >> Also, from the samba-technical mailing list came the advice to use a >> more recent ctdb - specifically, 1.2.61. I've got that built and ready >> to go (and a rebuilt samba compiled against it too), but if things prove >> to be stable after today's compacting, then we will probably leave it at >> that and not deploy this. >> >> Interesting that 2.0 wasn't suggested for "stable", and that the current >> "dev" version is 2.1. >> >> For reference, here's the start of the thread: >> https://lists.samba.org/archive/samba-technical/2013-April/091525.html >> >> -- >> Orlando. >> >> >> >>> >>> On 12 Apr 2013, at 16:43, Orlando Richards >>> wrote: >>> >>>> On 12/04/13 15:43, Bob Cregan wrote: >>>>> Hi Orlando, >>>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>>> (GPFS version 3.4.0-13) . Current versions are >>>>> >>>>> ctdb - 1.0.99 >>>>> samba 3.5.15 >>>>> >>>>> Both compiled from source. We have about 300+ users normally. >>>>> >>>> >>>> We have suspicions that 3.6 has put additional "chatter" into the >>>> ctdb database stream, which has pushed us over the edge. Barry Evans >>>> has found that the clustered locking databases, in particular, prove >>>> to be a scalability/usability limit for ctdb. >>>> >>>> >>>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>>> bad moments over the last year . These have gone away since we have >>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>>> There >>>>> have been no issues with samba/ctdb >>>>> >>>>> The only comment I can make is that during initial investigations into >>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>>> with error messages like: >>>>> >>>>> configure: checking whether cluster support is available >>>>> checking for ctdb.h... yes >>>>> checking for ctdb_private.h... yes >>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>>> configure: error: "cluster support not available: support for >>>>> SCHEDULE_FOR_DELETION control missing" >>>>> >>>>> >>>>> What occurs to me is that this message seems to indicate that it is >>>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>>> That would imply that an upgrade to a higher version of ctdb might >>>>> help, of course it might not and make backing out harder. >>>> >>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>>> The versioning in CTDB has proved hard for me to fathom... >>>> >>>>> >>>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>>> meeting first! >>>>> >>>> >>>> It has to be said - the timing is good! >>>> Cheers, >>>> Orlando >>>> >>>>> >>>>> Thanks >>>>> >>>>> Bob >>>>> >>>>> >>>>> On 12 April 2013 13:37, Orlando Richards >>>> > wrote: >>>>> >>>>> Hi folks, ac >>>>> >>>>> We've long been using CTDB and Samba for our NAS service, >>>>> servicing >>>>> ~500 users. We've been suffering from some problems with the CTDB >>>>> performance over the last few weeks, likely triggered either by an >>>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>>> result), >>>>> or possibly by additional users coming on with a new workload. >>>>> >>>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>>> from sernet). Before we roll back, we'd like to make sure we can't >>>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>>> that a roll back would fix the issue). >>>>> >>>>> The symptoms are a complete freeze of the service for CIFS users >>>>> for >>>>> 10-60 seconds, and on the servers a corresponding spawning of >>>>> large >>>>> numbers of CTDB processes, which seem to be created in a "big >>>>> bang", >>>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>>> >>>>> We also serve up NFS from the same ctdb-managed frontends, and >>>>> GPFS >>>>> from the cluster - and these are both fine throughout. >>>>> >>>>> This was happening 5-10 times per hour, not at exact intervals >>>>> though. When we added a third node to the CTDB cluster, it "got >>>>> worse", and when we dropped the CTDB cluster down to a single node >>>>> and everything started behaving fine - which is where we are now. >>>>> >>>>> So, I've got a bunch of questions! >>>>> >>>>> - does anyone know why ctdb would be spawning these processes, >>>>> and >>>>> if there's anything we can do to stop it needing to do it? >>>>> - has anyone done any more general performance / config >>>>> optimisation of CTDB? >>>>> >>>>> And - more generally - does anyone else actually use >>>>> ctdb/samba/gpfs >>>>> on the scale of ~500 users or higher? If so - how do you find it? >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Dr Orlando Richards >>>>> Information Services >>>>> IT Infrastructure Division >>>>> Unix Section >>>>> Tel: 0131 650 4994 >>>>> >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> _________________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Bob Cregan >>>>> >>>>> Senior Storage Systems Administrator >>>>> >>>>> ACRC >>>>> >>>>> Bristol University >>>>> >>>>> Tel: +44 (0) 117 331 4406 >>>>> >>>>> skype: bobcregan >>>>> >>>>> Mobile: +44 (0) 7712388129 >>>>> >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 10:38:07 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 10:38:07 +0100 Subject: [gpfsug-discuss] Test cluster - some questions Message-ID: Hi all Good to see lots of you at the user group meeting yesterday. Great work, Jez! We're setting up a test cluster here at Realise, with a view to moving our main storage over from Gluster. We're running the test cluster on Isilon hardware ... a couple of 1920 nodes that we were using for home dirs. Each node has dual gigabit ethernet ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB RAM. All good stuff and should make a nice test rig. I have a few questions! 1. We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah to 3.5? 2. I'm having trouble assigning NSDs. I have a descfile which looks like: #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 but the command "mmcrnsd -F /tmp/descfile -v no" just craps out with mmcrnsd: Processing disk sdc1 mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license designation. mmcrnsd: Error found while checking disk descriptor /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 mmcrnsd: Command failed. Examine previous error messages to determine cause. Any help pointing me gently in the right direction would be much appreciated. :-) TIA -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Apr 25 10:48:30 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 25 Apr 2013 10:48:30 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: <5178FBEE.4070200@ed.ac.uk> On 25/04/13 10:38, Pete Smith wrote: > Hi all > > Good to see lots of you at the user group meeting yesterday. Great work, > Jez! > > We're setting up a test cluster here at Realise, with a view to moving > our main storage over from Gluster. > > We're running the test cluster on Isilon hardware ... a couple of 1920 > nodes that we were using for home dirs. Each node has dual gigabit > ethernet ports, and dual infiniband ports. Single dual-core Xeon proc > and and 4GB RAM. All good stuff and should make a nice test rig. > > I have a few questions! > > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? > 2. I'm having trouble assigning NSDs. I have a descfile which looks like: > > #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > > but the command > > "mmcrnsd -F /tmp/descfile -v no" > > just craps out with > > mmcrnsd: Processing disk sdc1 > mmcrnsd: Node gpfs001.realisestudio.com > does not have a GPFS server license > designation. > mmcrnsd: Error found while checking disk descriptor > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > mmcrnsd: Command failed. Examine previous error messages to determine > cause. > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > Any help pointing me gently in the right direction would be much > appreciated. :-) > > TIA > > -- > Pete Smith > DevOp/System Administrator > Realise Studio > 12/13 Poland Street, London W1F 8QB > T. +44 (0)20 7165 9644 > > realisestudio.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 11:05:36 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 11:05:36 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: <5178FBEE.4070200@ed.ac.uk> References: <5178FBEE.4070200@ed.ac.uk> Message-ID: Thanks Orlando. Much appreciated. On 25 April 2013 10:48, Orlando Richards wrote: > On 25/04/13 10:38, Pete Smith wrote: > >> Hi all >> >> Good to see lots of you at the user group meeting yesterday. Great work, >> Jez! >> >> We're setting up a test cluster here at Realise, with a view to moving >> our main storage over from Gluster. >> >> We're running the test cluster on Isilon hardware ... a couple of 1920 >> nodes that we were using for home dirs. Each node has dual gigabit >> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc >> and and 4GB RAM. All good stuff and should make a nice test rig. >> >> I have a few questions! >> >> 1. We're on centos6.4.x86_64. What's the easiest way to go from >> 3.3.blah to 3.5? >> 2. I'm having trouble assigning NSDs. I have a descfile which looks like: >> >> #DiskName:PrimaryServer:**BackupServer:DiskUsage:** >> FailureGroup:DesiredName:**StoragePool >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> >> but the command >> >> "mmcrnsd -F /tmp/descfile -v no" >> >> just craps out with >> >> mmcrnsd: Processing disk sdc1 >> mmcrnsd: Node gpfs001.realisestudio.com >> > >> does not have a GPFS server license >> designation. >> mmcrnsd: Error found while checking disk descriptor >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> mmcrnsd: Command failed. Examine previous error messages to determine >> cause. >> >> > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > > > Any help pointing me gently in the right direction would be much >> appreciated. :-) >> >> TIA >> >> -- >> Pete Smith >> DevOp/System Administrator >> Realise Studio >> 12/13 Poland Street, London W1F 8QB >> T. +44 (0)20 7165 9644 >> >> realisestudio.com >> >> >> ______________________________**_________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss >> >> > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > ______________________________**_________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/**listinfo/gpfsug-discuss > -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From pete at realisestudio.com Fri Apr 26 16:06:38 2013 From: pete at realisestudio.com (Pete Smith) Date: Fri, 26 Apr 2013 16:06:38 +0100 Subject: [gpfsug-discuss] GPS Native RAID on linux? Message-ID: Hi I thought from the presentation that this was available on linux ... but documentation seems to indicate IBM GSS only? -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuartb at 4gh.net Tue Apr 30 21:50:38 2013 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT) Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote: > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? We are in transition to 3.5 on our original GPFS installation. Two of four servers are now at GPFS 3.4.XX/CentOS 6.4. Two servers are still at 3.3.YY/CentOS 5.4. The compute nodes are all to 3.4.XX/CentOS 6.4. The data center is remotely located and it is a pain to get physical access. Once we get the last two nodes upgraded, we expect to go to GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS installation). My understanding is that you need to step through 3.4 during a migration from 3.3 to 3.5. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From bdeluca at gmail.com Wed Apr 3 10:57:05 2013 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 3 Apr 2013 10:57:05 +0100 Subject: [gpfsug-discuss] mmbackup and management classes Message-ID: Hi gpfsusers, My first post to the list, Hi! We tsm for our backups of our gpfs filesystems, we are looking at using the mmbackup for script for launching our backups. >From conversations with other people we hear that support for management classes may not be completely available in mmbackup? I wondered if any one could comment on using mmbackup, and what and what not is supported. Any gotchas? -bd -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Wed Apr 3 13:04:47 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Wed, 3 Apr 2013 16:04:47 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/08/2013) Message-ID: I am out of the office until 04/08/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02. This is the only notification you will receive while this person is away. From chris_stone at uk.ibm.com Wed Apr 3 16:08:39 2013 From: chris_stone at uk.ibm.com (Chris Stone) Date: Wed, 3 Apr 2013 16:08:39 +0100 Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office until 16/08/2004. (returning 11/04/2013) Message-ID: I am out of the office until 11/04/2013. In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017 418 Note: This is an automated response to your message "[gpfsug-discuss] mmbackup and management classes" sent on 03/04/2013 10:57:05. This is the only notification you will receive while this person is away. From ANDREWD at uk.ibm.com Wed Apr 3 16:10:26 2013 From: ANDREWD at uk.ibm.com (Andrew Downes1) Date: Wed, 3 Apr 2013 16:10:26 +0100 Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office (returning 08/04/2013) Message-ID: I am out of the office until 08/04/2013. If anything is too urgent to wait for my return please contact Matt Ayres mailto:m_ayres at uk.ibm.com 44-7710-981527 In case of urgency, please contact our manager Dave Shave-Wall mailto:dave_shavewall at uk.ibm.com 44-7740-921623 Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02. This is the only notification you will receive while this person is away. From ashish.thandavan at cs.ox.ac.uk Thu Apr 11 10:58:41 2013 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Thu, 11 Apr 2013 10:58:41 +0100 Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged In-Reply-To: References: Message-ID: <51668951.7040506@cs.ox.ac.uk> Dear Claire, I trust you are well! If there are any spaces left, could you please register me for the event? Thank you! Regards, Ash On 25/03/13 14:38, Claire Robson wrote: > > Dear All, > > The next meeting date is set for *Wednesday 24^th April* and will be > taking place at the fantastic Dolby Studios in London (Dolby Europe > Limited, 4--6 Soho Square, London W1D 3PZ). > > *Getting to Dolby Europe Limited, Soho Square, London* > > Leave the Tottenham Court Road tube station by the South Oxford Street > exit [Exit 1]. > > Turn left onto Oxford Street. > > After about 50m turn left into Soho Street. > > Turn right into Soho Square. > > 4-6 Soho Square is directly in front of you. > > Our tentative agenda is as follows: > > 10:30 Arrivals and refreshments > > 11:00 Introductions and committee updates > > Jez Tucker, Group Chair & Claire Robson, Group Secretary > > 11:05 GPFS OpenStack Integration > > Prasenhit Sarkar, IBM Almaden Research Labs > > GPFS FPO > > Dinesh Subhraveti, IBM Almaden Research Labs > > 11:45 SAMBA 4.0 & CTDB 2.0 > > Michael Adams, SAMBA Development Team > > 12:15 SAMBA & GPFS Integration > > Volker Lendecke, SAMBA Development Team > > 13:00 Lunch (Buffet provided) > > 14:00 GPFS Native RAID & LTFS > > Jim Roche, IBM > > 14:45 User Stories > > 15:45 Group discussion: Challenges, experiences and questions & > Committee matters > > Led by Jez Tucker, Group Chairperson > > 16:00 Close > > We will be starting at 11:00am and concluding at 4pm but some of the > speaker timings may alter slightly. I will be posting further details > on what the presentations cover over the coming week or so. > > We hope you can make it for what will be a really interesting day of > GPFS discussions. *Please register with me if you would like to > attend* -- registrations are based on a first come first served basis. > > Best regards, > > *Claire Robson* > > GPFS User Group Secreatry > > Tel: 0114 257 2200 > > Mob: 07508 033896 > > Fax: 0114 257 0022 > > Web: _www.gpfsug.org _ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 13:37:52 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 13:37:52 +0100 Subject: [gpfsug-discuss] CTDB woes Message-ID: <51680020.4040509@ed.ac.uk> Hi folks, We've long been using CTDB and Samba for our NAS service, servicing ~500 users. We've been suffering from some problems with the CTDB performance over the last few weeks, likely triggered either by an upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by additional users coming on with a new workload. We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from sernet). Before we roll back, we'd like to make sure we can't fix the problem and stick with Samba 3.6 (and we don't even know that a roll back would fix the issue). The symptoms are a complete freeze of the service for CIFS users for 10-60 seconds, and on the servers a corresponding spawning of large numbers of CTDB processes, which seem to be created in a "big bang", and then do what they do and exit in the subsequent 10-60 seconds. We also serve up NFS from the same ctdb-managed frontends, and GPFS from the cluster - and these are both fine throughout. This was happening 5-10 times per hour, not at exact intervals though. When we added a third node to the CTDB cluster, it "got worse", and when we dropped the CTDB cluster down to a single node and everything started behaving fine - which is where we are now. So, I've got a bunch of questions! - does anyone know why ctdb would be spawning these processes, and if there's anything we can do to stop it needing to do it? - has anyone done any more general performance / config optimisation of CTDB? And - more generally - does anyone else actually use ctdb/samba/gpfs on the scale of ~500 users or higher? If so - how do you find it? -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From Tobias.Kuebler at sva.de Fri Apr 12 14:03:58 2013 From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de) Date: Fri, 12 Apr 2013 15:03:58 +0200 Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?= =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?= Message-ID: Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend. Vielen Dank f?r Ihre Nachricht. Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet, ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu beantworten. In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen Vertriebsbeauftragten. Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht "[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52. Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend diese Person abwesend ist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Fri Apr 12 16:43:44 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Fri, 12 Apr 2013 16:43:44 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: References: <51680020.4040509@ed.ac.uk> Message-ID: <51682BB0.7010507@ed.ac.uk> On 12/04/13 15:43, Bob Cregan wrote: > Hi Orlando, > We use ctdb/samba for CIFS, and CNFS for NFS > (GPFS version 3.4.0-13) . Current versions are > > ctdb - 1.0.99 > samba 3.5.15 > > Both compiled from source. We have about 300+ users normally. > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > We have had no issues with this setup apart from CNFS which had 2 or 3 > bad moments over the last year . These have gone away since we have > fixed a bug with our 10G NIC drivers (emulex cards , kernel module > be2net) which lead to occasional dropped packets for jumbo frames. There > have been no issues with samba/ctdb > > The only comment I can make is that during initial investigations into > an upgrade of samba to 3.6.x we discovered that the 3.6 code would not > compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) > with error messages like: > > configure: checking whether cluster support is available > checking for ctdb.h... yes > checking for ctdb_private.h... yes > checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes > checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no > configure: error: "cluster support not available: support for > SCHEDULE_FOR_DELETION control missing" > > > What occurs to me is that this message seems to indicate that it is > possible to run a ctdb version that is incompatible with samba 3.6. > That would imply that an upgrade to a higher version of ctdb might > help, of course it might not and make backing out harder. Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > > A compile against ctdb 2.0 works fine. We will soon be running in this > upgrade, but I'm waiting to see what the samba people say at the UG > meeting first! > It has to be said - the timing is good! Cheers, Orlando > > Thanks > > Bob > > > On 12 April 2013 13:37, Orlando Richards > wrote: > > Hi folks, ac > > We've long been using CTDB and Samba for our NAS service, servicing > ~500 users. We've been suffering from some problems with the CTDB > performance over the last few weeks, likely triggered either by an > upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), > or possibly by additional users coming on with a new workload. > > We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, > from sernet). Before we roll back, we'd like to make sure we can't > fix the problem and stick with Samba 3.6 (and we don't even know > that a roll back would fix the issue). > > The symptoms are a complete freeze of the service for CIFS users for > 10-60 seconds, and on the servers a corresponding spawning of large > numbers of CTDB processes, which seem to be created in a "big bang", > and then do what they do and exit in the subsequent 10-60 seconds. > > We also serve up NFS from the same ctdb-managed frontends, and GPFS > from the cluster - and these are both fine throughout. > > This was happening 5-10 times per hour, not at exact intervals > though. When we added a third node to the CTDB cluster, it "got > worse", and when we dropped the CTDB cluster down to a single node > and everything started behaving fine - which is where we are now. > > So, I've got a bunch of questions! > > - does anyone know why ctdb would be spawning these processes, and > if there's anything we can do to stop it needing to do it? > - has anyone done any more general performance / config > optimisation of CTDB? > > And - more generally - does anyone else actually use ctdb/samba/gpfs > on the scale of ~500 users or higher? If so - how do you find it? > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _________________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/__listinfo/gpfsug-discuss > > > > > > -- > > Bob Cregan > > Senior Storage Systems Administrator > > ACRC > > Bristol University > > Tel: +44 (0) 117 331 4406 > > skype: bobcregan > > Mobile: +44 (0) 7712388129 > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From viccornell at gmail.com Fri Apr 12 19:44:16 2013 From: viccornell at gmail.com (Vic Cornell) Date: Fri, 12 Apr 2013 19:44:16 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <51682BB0.7010507@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Have you tried putting the ctdb files onto a separate gpfs filesystem? Vic Cornell viccornell at gmail.com On 12 Apr 2013, at 16:43, Orlando Richards wrote: > On 12/04/13 15:43, Bob Cregan wrote: >> Hi Orlando, >> We use ctdb/samba for CIFS, and CNFS for NFS >> (GPFS version 3.4.0-13) . Current versions are >> >> ctdb - 1.0.99 >> samba 3.5.15 >> >> Both compiled from source. We have about 300+ users normally. >> > > We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. > > >> We have had no issues with this setup apart from CNFS which had 2 or 3 >> bad moments over the last year . These have gone away since we have >> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >> be2net) which lead to occasional dropped packets for jumbo frames. There >> have been no issues with samba/ctdb >> >> The only comment I can make is that during initial investigations into >> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >> with error messages like: >> >> configure: checking whether cluster support is available >> checking for ctdb.h... yes >> checking for ctdb_private.h... yes >> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >> configure: error: "cluster support not available: support for >> SCHEDULE_FOR_DELETION control missing" >> >> >> What occurs to me is that this message seems to indicate that it is >> possible to run a ctdb version that is incompatible with samba 3.6. >> That would imply that an upgrade to a higher version of ctdb might >> help, of course it might not and make backing out harder. > > Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... > >> >> A compile against ctdb 2.0 works fine. We will soon be running in this >> upgrade, but I'm waiting to see what the samba people say at the UG >> meeting first! >> > > It has to be said - the timing is good! > Cheers, > Orlando > >> >> Thanks >> >> Bob >> >> >> On 12 April 2013 13:37, Orlando Richards > > wrote: >> >> Hi folks, ac >> >> We've long been using CTDB and Samba for our NAS service, servicing >> ~500 users. We've been suffering from some problems with the CTDB >> performance over the last few weeks, likely triggered either by an >> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >> or possibly by additional users coming on with a new workload. >> >> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >> from sernet). Before we roll back, we'd like to make sure we can't >> fix the problem and stick with Samba 3.6 (and we don't even know >> that a roll back would fix the issue). >> >> The symptoms are a complete freeze of the service for CIFS users for >> 10-60 seconds, and on the servers a corresponding spawning of large >> numbers of CTDB processes, which seem to be created in a "big bang", >> and then do what they do and exit in the subsequent 10-60 seconds. >> >> We also serve up NFS from the same ctdb-managed frontends, and GPFS >> from the cluster - and these are both fine throughout. >> >> This was happening 5-10 times per hour, not at exact intervals >> though. When we added a third node to the CTDB cluster, it "got >> worse", and when we dropped the CTDB cluster down to a single node >> and everything started behaving fine - which is where we are now. >> >> So, I've got a bunch of questions! >> >> - does anyone know why ctdb would be spawning these processes, and >> if there's anything we can do to stop it needing to do it? >> - has anyone done any more general performance / config >> optimisation of CTDB? >> >> And - more generally - does anyone else actually use ctdb/samba/gpfs >> on the scale of ~500 users or higher? If so - how do you find it? >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> _________________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >> >> >> >> >> >> -- >> >> Bob Cregan >> >> Senior Storage Systems Administrator >> >> ACRC >> >> Bristol University >> >> Tel: +44 (0) 117 331 4406 >> >> skype: bobcregan >> >> Mobile: +44 (0) 7712388129 >> > > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From orlando.richards at ed.ac.uk Mon Apr 15 10:54:39 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 15 Apr 2013 10:54:39 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> Message-ID: <516BCE5F.8010309@ed.ac.uk> On 12/04/13 19:44, Vic Cornell wrote: > Have you tried putting the ctdb files onto a separate gpfs filesystem? No - but considered it. However, the only "live" CTDB file that sits on GPFS is the reclock file, which - I think - is only used as the heartbeat between nodes and for the recovery process. Now, there's mileage in insulating that, certainly, but I don't think that's what we're suffering from here. On a positive note - we took the steps this morning to re-initialise the ctdb databases from current data, and things seem to be stable today so far. Basically - shut down ctdb on all but one node. On all but that node, do: mv /var/ctdb/ /var/ctdb.save.date then start up ctdb on those nodes. Once they've come up, shut down ctdb on the last node, move /var/ctdb out the way, and restart. That brings them all up with freshly compacted databases. Also, from the samba-technical mailing list came the advice to use a more recent ctdb - specifically, 1.2.61. I've got that built and ready to go (and a rebuilt samba compiled against it too), but if things prove to be stable after today's compacting, then we will probably leave it at that and not deploy this. Interesting that 2.0 wasn't suggested for "stable", and that the current "dev" version is 2.1. For reference, here's the start of the thread: https://lists.samba.org/archive/samba-technical/2013-April/091525.html -- Orlando. > > On 12 Apr 2013, at 16:43, Orlando Richards wrote: > >> On 12/04/13 15:43, Bob Cregan wrote: >>> Hi Orlando, >>> We use ctdb/samba for CIFS, and CNFS for NFS >>> (GPFS version 3.4.0-13) . Current versions are >>> >>> ctdb - 1.0.99 >>> samba 3.5.15 >>> >>> Both compiled from source. We have about 300+ users normally. >>> >> >> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb. >> >> >>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>> bad moments over the last year . These have gone away since we have >>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>> be2net) which lead to occasional dropped packets for jumbo frames. There >>> have been no issues with samba/ctdb >>> >>> The only comment I can make is that during initial investigations into >>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>> with error messages like: >>> >>> configure: checking whether cluster support is available >>> checking for ctdb.h... yes >>> checking for ctdb_private.h... yes >>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>> configure: error: "cluster support not available: support for >>> SCHEDULE_FOR_DELETION control missing" >>> >>> >>> What occurs to me is that this message seems to indicate that it is >>> possible to run a ctdb version that is incompatible with samba 3.6. >>> That would imply that an upgrade to a higher version of ctdb might >>> help, of course it might not and make backing out harder. >> >> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom... >> >>> >>> A compile against ctdb 2.0 works fine. We will soon be running in this >>> upgrade, but I'm waiting to see what the samba people say at the UG >>> meeting first! >>> >> >> It has to be said - the timing is good! >> Cheers, >> Orlando >> >>> >>> Thanks >>> >>> Bob >>> >>> >>> On 12 April 2013 13:37, Orlando Richards >> > wrote: >>> >>> Hi folks, ac >>> >>> We've long been using CTDB and Samba for our NAS service, servicing >>> ~500 users. We've been suffering from some problems with the CTDB >>> performance over the last few weeks, likely triggered either by an >>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result), >>> or possibly by additional users coming on with a new workload. >>> >>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>> from sernet). Before we roll back, we'd like to make sure we can't >>> fix the problem and stick with Samba 3.6 (and we don't even know >>> that a roll back would fix the issue). >>> >>> The symptoms are a complete freeze of the service for CIFS users for >>> 10-60 seconds, and on the servers a corresponding spawning of large >>> numbers of CTDB processes, which seem to be created in a "big bang", >>> and then do what they do and exit in the subsequent 10-60 seconds. >>> >>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>> from the cluster - and these are both fine throughout. >>> >>> This was happening 5-10 times per hour, not at exact intervals >>> though. When we added a third node to the CTDB cluster, it "got >>> worse", and when we dropped the CTDB cluster down to a single node >>> and everything started behaving fine - which is where we are now. >>> >>> So, I've got a bunch of questions! >>> >>> - does anyone know why ctdb would be spawning these processes, and >>> if there's anything we can do to stop it needing to do it? >>> - has anyone done any more general performance / config >>> optimisation of CTDB? >>> >>> And - more generally - does anyone else actually use ctdb/samba/gpfs >>> on the scale of ~500 users or higher? If so - how do you find it? >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _________________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>> >>> >>> >>> >>> >>> -- >>> >>> Bob Cregan >>> >>> Senior Storage Systems Administrator >>> >>> ACRC >>> >>> Bristol University >>> >>> Tel: +44 (0) 117 331 4406 >>> >>> skype: bobcregan >>> >>> Mobile: +44 (0) 7712388129 >>> >> >> >> -- >> -- >> Dr Orlando Richards >> Information Services >> IT Infrastructure Division >> Unix Section >> Tel: 0131 650 4994 >> >> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From crobson at ocf.co.uk Mon Apr 15 15:04:38 2013 From: crobson at ocf.co.uk (Claire Robson) Date: Mon, 15 Apr 2013 15:04:38 +0100 Subject: [gpfsug-discuss] Latest agenda and places still available Message-ID: Dear All, Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend. This is the latest agenda for the day: 10:30 Arrivals and refreshments 11:00 Introductions and committee updates Jez Tucker, Group Chair & Claire Robson, Group Secretary 11:05 GPFS FPO Dinesh Subhraveti, IBM Almaden Research Labs 12:00 SAMBA 4.0 & CTDB 2.0 Michael Adams, SAMBA Development Team 13:00 Lunch (Buffet provided) 13:45 GPFS OpenStack Integration Dinesh Subhraveti, IBM Almaden Research Labs 14:15 SAMBA & GPFS Integration Volker Lendecke, SAMBA Development Team 15:15 Refreshments break 15:30 GPFS Native RAID & LTFS Jim Roche, IBM 16:00 Group discussion: Questions & Committee matters Led by Jez Tucker, Group Chairperson 16:05 Close I look forward to seeing many of you next week. Kind regards, Claire Robson GPFS user group Secetary Tel: 0114 257 2200 Mob: 07508 033896 Web: www.gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From AHMADYH at sa.ibm.com Tue Apr 16 13:08:58 2013 From: AHMADYH at sa.ibm.com (Ahmad Y Hussein) Date: Tue, 16 Apr 2013 16:08:58 +0400 Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office (returning 04/29/2013) Message-ID: I am out of the office until 04/29/2013. Dear Sender; I am in a customer engagement with extremely limited email access, I will respond to your emails as soon as i can. For Urjent cases please call me on my mobile (+966542001289). Thank you for understanding. Regards; Ahmad Y Hussein Note: This is an automated response to your message "gpfsug-discuss Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02. This is the only notification you will receive while this person is away. From orlando.richards at ed.ac.uk Wed Apr 17 11:30:32 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Wed, 17 Apr 2013 11:30:32 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516BCE5F.8010309@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> Message-ID: <516E79C8.8090603@ed.ac.uk> Hi All - an update to this, After re-initialising the databases on Monday, things did seem to be running better, but ultimately we got back to suffering from spikes in ctdb processes and corresponding "pauses" in service. We fell back to a single node again for Tuesday (and things were stable once again), and this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was rebuilt against CTDB 1.2.61 headers). Things seem to be stable for now - more so than on Monday. For the record - one metric I'm watching is the number of ctdb processes running (this would spike to > 1000 under the failure conditions). It's currently sitting consistently at 3 processes, with occasional blips of 5-7 processes. -- Orlando On 15/04/13 10:54, Orlando Richards wrote: > On 12/04/13 19:44, Vic Cornell wrote: >> Have you tried putting the ctdb files onto a separate gpfs filesystem? > > No - but considered it. However, the only "live" CTDB file that sits on > GPFS is the reclock file, which - I think - is only used as the > heartbeat between nodes and for the recovery process. Now, there's > mileage in insulating that, certainly, but I don't think that's what > we're suffering from here. > > On a positive note - we took the steps this morning to re-initialise the > ctdb databases from current data, and things seem to be stable today so > far. > > Basically - shut down ctdb on all but one node. On all but that node, do: > mv /var/ctdb/ /var/ctdb.save.date > > then start up ctdb on those nodes. Once they've come up, shut down ctdb > on the last node, move /var/ctdb out the way, and restart. That brings > them all up with freshly compacted databases. > > Also, from the samba-technical mailing list came the advice to use a > more recent ctdb - specifically, 1.2.61. I've got that built and ready > to go (and a rebuilt samba compiled against it too), but if things prove > to be stable after today's compacting, then we will probably leave it at > that and not deploy this. > > Interesting that 2.0 wasn't suggested for "stable", and that the current > "dev" version is 2.1. > > For reference, here's the start of the thread: > https://lists.samba.org/archive/samba-technical/2013-April/091525.html > > -- > Orlando. > > > >> >> On 12 Apr 2013, at 16:43, Orlando Richards >> wrote: >> >>> On 12/04/13 15:43, Bob Cregan wrote: >>>> Hi Orlando, >>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>> (GPFS version 3.4.0-13) . Current versions are >>>> >>>> ctdb - 1.0.99 >>>> samba 3.5.15 >>>> >>>> Both compiled from source. We have about 300+ users normally. >>>> >>> >>> We have suspicions that 3.6 has put additional "chatter" into the >>> ctdb database stream, which has pushed us over the edge. Barry Evans >>> has found that the clustered locking databases, in particular, prove >>> to be a scalability/usability limit for ctdb. >>> >>> >>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>> bad moments over the last year . These have gone away since we have >>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>> There >>>> have been no issues with samba/ctdb >>>> >>>> The only comment I can make is that during initial investigations into >>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>> with error messages like: >>>> >>>> configure: checking whether cluster support is available >>>> checking for ctdb.h... yes >>>> checking for ctdb_private.h... yes >>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>> configure: error: "cluster support not available: support for >>>> SCHEDULE_FOR_DELETION control missing" >>>> >>>> >>>> What occurs to me is that this message seems to indicate that it is >>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>> That would imply that an upgrade to a higher version of ctdb might >>>> help, of course it might not and make backing out harder. >>> >>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>> The versioning in CTDB has proved hard for me to fathom... >>> >>>> >>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>> meeting first! >>>> >>> >>> It has to be said - the timing is good! >>> Cheers, >>> Orlando >>> >>>> >>>> Thanks >>>> >>>> Bob >>>> >>>> >>>> On 12 April 2013 13:37, Orlando Richards >>> > wrote: >>>> >>>> Hi folks, ac >>>> >>>> We've long been using CTDB and Samba for our NAS service, servicing >>>> ~500 users. We've been suffering from some problems with the CTDB >>>> performance over the last few weeks, likely triggered either by an >>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>> result), >>>> or possibly by additional users coming on with a new workload. >>>> >>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>> from sernet). Before we roll back, we'd like to make sure we can't >>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>> that a roll back would fix the issue). >>>> >>>> The symptoms are a complete freeze of the service for CIFS users >>>> for >>>> 10-60 seconds, and on the servers a corresponding spawning of large >>>> numbers of CTDB processes, which seem to be created in a "big >>>> bang", >>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>> >>>> We also serve up NFS from the same ctdb-managed frontends, and GPFS >>>> from the cluster - and these are both fine throughout. >>>> >>>> This was happening 5-10 times per hour, not at exact intervals >>>> though. When we added a third node to the CTDB cluster, it "got >>>> worse", and when we dropped the CTDB cluster down to a single node >>>> and everything started behaving fine - which is where we are now. >>>> >>>> So, I've got a bunch of questions! >>>> >>>> - does anyone know why ctdb would be spawning these processes, >>>> and >>>> if there's anything we can do to stop it needing to do it? >>>> - has anyone done any more general performance / config >>>> optimisation of CTDB? >>>> >>>> And - more generally - does anyone else actually use >>>> ctdb/samba/gpfs >>>> on the scale of ~500 users or higher? If so - how do you find it? >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _________________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Bob Cregan >>>> >>>> Senior Storage Systems Administrator >>>> >>>> ACRC >>>> >>>> Bristol University >>>> >>>> Tel: +44 (0) 117 331 4406 >>>> >>>> skype: bobcregan >>>> >>>> Mobile: +44 (0) 7712388129 >>>> >>> >>> >>> -- >>> -- >>> Dr Orlando Richards >>> Information Services >>> IT Infrastructure Division >>> Unix Section >>> Tel: 0131 650 4994 >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From orlando.richards at ed.ac.uk Mon Apr 22 15:52:55 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Mon, 22 Apr 2013 15:52:55 +0100 Subject: [gpfsug-discuss] CTDB woes In-Reply-To: <516E79C8.8090603@ed.ac.uk> References: <51680020.4040509@ed.ac.uk> <51682BB0.7010507@ed.ac.uk> <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com> <516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk> Message-ID: <51754EC7.8000600@ed.ac.uk> On 17/04/13 11:30, Orlando Richards wrote: > Hi All - an update to this, > > After re-initialising the databases on Monday, things did seem to be > running better, but ultimately we got back to suffering from spikes in > ctdb processes and corresponding "pauses" in service. We fell back to a > single node again for Tuesday (and things were stable once again), and > this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was > rebuilt against CTDB 1.2.61 headers). > > Things seem to be stable for now - more so than on Monday. > > For the record - one metric I'm watching is the number of ctdb processes > running (this would spike to > 1000 under the failure conditions). It's > currently sitting consistently at 3 processes, with occasional blips of > 5-7 processes. > Hi all, Looks like things have been running fine since we upgraded ctdb last Wednesday, so I think it's safe to say that we've found a fix for our problem in CTDB 1.2.61. Thanks for all the input! If anyone wants more info, feel free to get in touch. -- Orlando > -- > Orlando > > > > > > On 15/04/13 10:54, Orlando Richards wrote: >> On 12/04/13 19:44, Vic Cornell wrote: >>> Have you tried putting the ctdb files onto a separate gpfs filesystem? >> >> No - but considered it. However, the only "live" CTDB file that sits on >> GPFS is the reclock file, which - I think - is only used as the >> heartbeat between nodes and for the recovery process. Now, there's >> mileage in insulating that, certainly, but I don't think that's what >> we're suffering from here. >> >> On a positive note - we took the steps this morning to re-initialise the >> ctdb databases from current data, and things seem to be stable today so >> far. >> >> Basically - shut down ctdb on all but one node. On all but that node, do: >> mv /var/ctdb/ /var/ctdb.save.date >> >> then start up ctdb on those nodes. Once they've come up, shut down ctdb >> on the last node, move /var/ctdb out the way, and restart. That brings >> them all up with freshly compacted databases. >> >> Also, from the samba-technical mailing list came the advice to use a >> more recent ctdb - specifically, 1.2.61. I've got that built and ready >> to go (and a rebuilt samba compiled against it too), but if things prove >> to be stable after today's compacting, then we will probably leave it at >> that and not deploy this. >> >> Interesting that 2.0 wasn't suggested for "stable", and that the current >> "dev" version is 2.1. >> >> For reference, here's the start of the thread: >> https://lists.samba.org/archive/samba-technical/2013-April/091525.html >> >> -- >> Orlando. >> >> >> >>> >>> On 12 Apr 2013, at 16:43, Orlando Richards >>> wrote: >>> >>>> On 12/04/13 15:43, Bob Cregan wrote: >>>>> Hi Orlando, >>>>> We use ctdb/samba for CIFS, and CNFS for NFS >>>>> (GPFS version 3.4.0-13) . Current versions are >>>>> >>>>> ctdb - 1.0.99 >>>>> samba 3.5.15 >>>>> >>>>> Both compiled from source. We have about 300+ users normally. >>>>> >>>> >>>> We have suspicions that 3.6 has put additional "chatter" into the >>>> ctdb database stream, which has pushed us over the edge. Barry Evans >>>> has found that the clustered locking databases, in particular, prove >>>> to be a scalability/usability limit for ctdb. >>>> >>>> >>>>> We have had no issues with this setup apart from CNFS which had 2 or 3 >>>>> bad moments over the last year . These have gone away since we have >>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module >>>>> be2net) which lead to occasional dropped packets for jumbo frames. >>>>> There >>>>> have been no issues with samba/ctdb >>>>> >>>>> The only comment I can make is that during initial investigations into >>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not >>>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source ) >>>>> with error messages like: >>>>> >>>>> configure: checking whether cluster support is available >>>>> checking for ctdb.h... yes >>>>> checking for ctdb_private.h... yes >>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes >>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no >>>>> configure: error: "cluster support not available: support for >>>>> SCHEDULE_FOR_DELETION control missing" >>>>> >>>>> >>>>> What occurs to me is that this message seems to indicate that it is >>>>> possible to run a ctdb version that is incompatible with samba 3.6. >>>>> That would imply that an upgrade to a higher version of ctdb might >>>>> help, of course it might not and make backing out harder. >>>> >>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! >>>> The versioning in CTDB has proved hard for me to fathom... >>>> >>>>> >>>>> A compile against ctdb 2.0 works fine. We will soon be running in this >>>>> upgrade, but I'm waiting to see what the samba people say at the UG >>>>> meeting first! >>>>> >>>> >>>> It has to be said - the timing is good! >>>> Cheers, >>>> Orlando >>>> >>>>> >>>>> Thanks >>>>> >>>>> Bob >>>>> >>>>> >>>>> On 12 April 2013 13:37, Orlando Richards >>>> > wrote: >>>>> >>>>> Hi folks, ac >>>>> >>>>> We've long been using CTDB and Samba for our NAS service, >>>>> servicing >>>>> ~500 users. We've been suffering from some problems with the CTDB >>>>> performance over the last few weeks, likely triggered either by an >>>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a >>>>> result), >>>>> or possibly by additional users coming on with a new workload. >>>>> >>>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, >>>>> from sernet). Before we roll back, we'd like to make sure we can't >>>>> fix the problem and stick with Samba 3.6 (and we don't even know >>>>> that a roll back would fix the issue). >>>>> >>>>> The symptoms are a complete freeze of the service for CIFS users >>>>> for >>>>> 10-60 seconds, and on the servers a corresponding spawning of >>>>> large >>>>> numbers of CTDB processes, which seem to be created in a "big >>>>> bang", >>>>> and then do what they do and exit in the subsequent 10-60 seconds. >>>>> >>>>> We also serve up NFS from the same ctdb-managed frontends, and >>>>> GPFS >>>>> from the cluster - and these are both fine throughout. >>>>> >>>>> This was happening 5-10 times per hour, not at exact intervals >>>>> though. When we added a third node to the CTDB cluster, it "got >>>>> worse", and when we dropped the CTDB cluster down to a single node >>>>> and everything started behaving fine - which is where we are now. >>>>> >>>>> So, I've got a bunch of questions! >>>>> >>>>> - does anyone know why ctdb would be spawning these processes, >>>>> and >>>>> if there's anything we can do to stop it needing to do it? >>>>> - has anyone done any more general performance / config >>>>> optimisation of CTDB? >>>>> >>>>> And - more generally - does anyone else actually use >>>>> ctdb/samba/gpfs >>>>> on the scale of ~500 users or higher? If so - how do you find it? >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Dr Orlando Richards >>>>> Information Services >>>>> IT Infrastructure Division >>>>> Unix Section >>>>> Tel: 0131 650 4994 >>>>> >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> _________________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Bob Cregan >>>>> >>>>> Senior Storage Systems Administrator >>>>> >>>>> ACRC >>>>> >>>>> Bristol University >>>>> >>>>> Tel: +44 (0) 117 331 4406 >>>>> >>>>> skype: bobcregan >>>>> >>>>> Mobile: +44 (0) 7712388129 >>>>> >>>> >>>> >>>> -- >>>> -- >>>> Dr Orlando Richards >>>> Information Services >>>> IT Infrastructure Division >>>> Unix Section >>>> Tel: 0131 650 4994 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >> > > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 10:38:07 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 10:38:07 +0100 Subject: [gpfsug-discuss] Test cluster - some questions Message-ID: Hi all Good to see lots of you at the user group meeting yesterday. Great work, Jez! We're setting up a test cluster here at Realise, with a view to moving our main storage over from Gluster. We're running the test cluster on Isilon hardware ... a couple of 1920 nodes that we were using for home dirs. Each node has dual gigabit ethernet ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB RAM. All good stuff and should make a nice test rig. I have a few questions! 1. We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah to 3.5? 2. I'm having trouble assigning NSDs. I have a descfile which looks like: #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 but the command "mmcrnsd -F /tmp/descfile -v no" just craps out with mmcrnsd: Processing disk sdc1 mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license designation. mmcrnsd: Error found while checking disk descriptor /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 mmcrnsd: Command failed. Examine previous error messages to determine cause. Any help pointing me gently in the right direction would be much appreciated. :-) TIA -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Apr 25 10:48:30 2013 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 25 Apr 2013 10:48:30 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: <5178FBEE.4070200@ed.ac.uk> On 25/04/13 10:38, Pete Smith wrote: > Hi all > > Good to see lots of you at the user group meeting yesterday. Great work, > Jez! > > We're setting up a test cluster here at Realise, with a view to moving > our main storage over from Gluster. > > We're running the test cluster on Isilon hardware ... a couple of 1920 > nodes that we were using for home dirs. Each node has dual gigabit > ethernet ports, and dual infiniband ports. Single dual-core Xeon proc > and and 4GB RAM. All good stuff and should make a nice test rig. > > I have a few questions! > > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? > 2. I'm having trouble assigning NSDs. I have a descfile which looks like: > > #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > > but the command > > "mmcrnsd -F /tmp/descfile -v no" > > just craps out with > > mmcrnsd: Processing disk sdc1 > mmcrnsd: Node gpfs001.realisestudio.com > does not have a GPFS server license > designation. > mmcrnsd: Error found while checking disk descriptor > /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1 > mmcrnsd: Command failed. Examine previous error messages to determine > cause. > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > Any help pointing me gently in the right direction would be much > appreciated. :-) > > TIA > > -- > Pete Smith > DevOp/System Administrator > Realise Studio > 12/13 Poland Street, London W1F 8QB > T. +44 (0)20 7165 9644 > > realisestudio.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Information Services IT Infrastructure Division Unix Section Tel: 0131 650 4994 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From pete at realisestudio.com Thu Apr 25 11:05:36 2013 From: pete at realisestudio.com (Pete Smith) Date: Thu, 25 Apr 2013 11:05:36 +0100 Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: <5178FBEE.4070200@ed.ac.uk> References: <5178FBEE.4070200@ed.ac.uk> Message-ID: Thanks Orlando. Much appreciated. On 25 April 2013 10:48, Orlando Richards wrote: > On 25/04/13 10:38, Pete Smith wrote: > >> Hi all >> >> Good to see lots of you at the user group meeting yesterday. Great work, >> Jez! >> >> We're setting up a test cluster here at Realise, with a view to moving >> our main storage over from Gluster. >> >> We're running the test cluster on Isilon hardware ... a couple of 1920 >> nodes that we were using for home dirs. Each node has dual gigabit >> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc >> and and 4GB RAM. All good stuff and should make a nice test rig. >> >> I have a few questions! >> >> 1. We're on centos6.4.x86_64. What's the easiest way to go from >> 3.3.blah to 3.5? >> 2. I'm having trouble assigning NSDs. I have a descfile which looks like: >> >> #DiskName:PrimaryServer:**BackupServer:DiskUsage:** >> FailureGroup:DesiredName:**StoragePool >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> >> but the command >> >> "mmcrnsd -F /tmp/descfile -v no" >> >> just craps out with >> >> mmcrnsd: Processing disk sdc1 >> mmcrnsd: Node gpfs001.realisestudio.com >> > >> does not have a GPFS server license >> designation. >> mmcrnsd: Error found while checking disk descriptor >> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1 >> mmcrnsd: Command failed. Examine previous error messages to determine >> cause. >> >> > mmchlicense server -N gpfs001.realisestudio.com should sort that one out. > > > Any help pointing me gently in the right direction would be much >> appreciated. :-) >> >> TIA >> >> -- >> Pete Smith >> DevOp/System Administrator >> Realise Studio >> 12/13 Poland Street, London W1F 8QB >> T. +44 (0)20 7165 9644 >> >> realisestudio.com >> >> >> ______________________________**_________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss >> >> > > -- > -- > Dr Orlando Richards > Information Services > IT Infrastructure Division > Unix Section > Tel: 0131 650 4994 > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > ______________________________**_________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/**listinfo/gpfsug-discuss > -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From pete at realisestudio.com Fri Apr 26 16:06:38 2013 From: pete at realisestudio.com (Pete Smith) Date: Fri, 26 Apr 2013 16:06:38 +0100 Subject: [gpfsug-discuss] GPS Native RAID on linux? Message-ID: Hi I thought from the presentation that this was available on linux ... but documentation seems to indicate IBM GSS only? -- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuartb at 4gh.net Tue Apr 30 21:50:38 2013 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT) Subject: [gpfsug-discuss] Test cluster - some questions In-Reply-To: References: Message-ID: On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote: > 1. We're on centos6.4.x86_64. What's the easiest way to go from > 3.3.blah to 3.5? We are in transition to 3.5 on our original GPFS installation. Two of four servers are now at GPFS 3.4.XX/CentOS 6.4. Two servers are still at 3.3.YY/CentOS 5.4. The compute nodes are all to 3.4.XX/CentOS 6.4. The data center is remotely located and it is a pain to get physical access. Once we get the last two nodes upgraded, we expect to go to GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS installation). My understanding is that you need to step through 3.4 during a migration from 3.3 to 3.5. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone