From bdeluca at gmail.com  Wed Apr  3 10:57:05 2013
From: bdeluca at gmail.com (Ben De Luca)
Date: Wed, 3 Apr 2013 10:57:05 +0100
Subject: [gpfsug-discuss] mmbackup and management classes
Message-ID: <CAGC__DhH-O8oxSY8t=xXyfV7+M6x-sfjwGP7AN3c_YKrDWm-Ug@mail.gmail.com>

Hi gpfsusers,
       My first post to the list, Hi!

We tsm for our backups of our gpfs filesystems, we are looking at using the
mmbackup for script for launching our backups.

>From conversations with other people we hear that support for  management
classes may not be completely available in mmbackup?

I wondered if any one could comment on using mmbackup, and what and what
not is supported. Any gotchas?


-bd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130403/1f70a7e1/attachment.htm>

From AHMADYH at sa.ibm.com  Wed Apr  3 13:04:47 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Wed, 3 Apr 2013 16:04:47 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/08/2013)
Message-ID: <OF2EDD4A8D.D01283B2-ON44257B42.00425B77-44257B42.00425B77@ae.ibm.com>


I am out of the office until 04/08/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From chris_stone at uk.ibm.com  Wed Apr  3 16:08:39 2013
From: chris_stone at uk.ibm.com (Chris Stone)
Date: Wed, 3 Apr 2013 16:08:39 +0100
Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office
 until 16/08/2004. (returning 11/04/2013)
Message-ID: <OF7A8A96D8.7012B265-ON80257B42.005330B4-80257B42.005330B4@uk.ibm.com>


I am out of the office until 11/04/2013.

In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017
418


Note: This is an automated response to your message  "[gpfsug-discuss]
mmbackup and management classes" sent on 03/04/2013 10:57:05.

This is the only notification you will receive while this person is away.


From ANDREWD at uk.ibm.com  Wed Apr  3 16:10:26 2013
From: ANDREWD at uk.ibm.com (Andrew Downes1)
Date: Wed, 3 Apr 2013 16:10:26 +0100
Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office
	(returning 08/04/2013)
Message-ID: <OFB0FCE5CF.87D63AFD-ON80257B42.00535A6C-80257B42.00535A6C@uk.ibm.com>


I am out of the office until 08/04/2013.

If anything is too urgent to wait  for my return please contact Matt Ayres
mailto:m_ayres at uk.ibm.com 44-7710-981527

In case of urgency, please contact our manager Dave Shave-Wall
mailto:dave_shavewall at uk.ibm.com 44-7740-921623


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02.

This is the only notification you will receive while this person is away.


From ashish.thandavan at cs.ox.ac.uk  Thu Apr 11 10:58:41 2013
From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan)
Date: Thu, 11 Apr 2013 10:58:41 +0100
Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged
In-Reply-To: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
References: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
Message-ID: <51668951.7040506@cs.ox.ac.uk>

Dear Claire,

I trust you are well! If there are any spaces left, could you please 
register me for the event?

Thank you!

Regards,
Ash

On 25/03/13 14:38, Claire Robson wrote:
>
> Dear All,
>
> The next meeting date is set for *Wednesday 24^th April* and will be 
> taking place at the fantastic Dolby Studios in London (Dolby Europe 
> Limited, 4--6 Soho Square, London W1D 3PZ).
>
> *Getting to Dolby Europe Limited, Soho Square, London*
>
> Leave the Tottenham Court Road tube station by the South Oxford Street 
> exit [Exit 1].
>
> Turn left onto Oxford Street.
>
> After about 50m turn left into Soho Street.
>
> Turn right into Soho Square.
>
> 4-6 Soho Square is directly in front of you.
>
> Our tentative agenda is as follows:
>
> 10:30     Arrivals and refreshments
>
> 11:00     Introductions and committee updates
>
> Jez Tucker, Group Chair & Claire Robson, Group Secretary
>
> 11:05     GPFS OpenStack Integration
>
> Prasenhit Sarkar, IBM Almaden Research Labs
>
>                GPFS FPO
>
>                Dinesh Subhraveti, IBM Almaden Research Labs
>
> 11:45     SAMBA 4.0 & CTDB 2.0
>
>                Michael Adams, SAMBA Development Team
>
> 12:15     SAMBA & GPFS Integration
>
>                Volker Lendecke, SAMBA Development Team
>
> 13:00     Lunch (Buffet provided)
>
> 14:00     GPFS Native RAID & LTFS
>
> Jim Roche, IBM
>
> 14:45     User Stories
>
> 15:45     Group discussion: Challenges, experiences and questions & 
> Committee matters
>
> Led by Jez Tucker, Group Chairperson
>
> 16:00     Close
>
> We will be starting at 11:00am and concluding at 4pm but some of the 
> speaker timings may alter slightly. I will be posting further details 
> on what the presentations cover over the coming week or so.
>
> We hope you can make it for what will be a really interesting day of 
> GPFS discussions. *Please register with me if you would like to 
> attend* -- registrations are based on a first come first served basis.
>
> Best regards,
>
> *Claire Robson*
>
> GPFS User Group Secreatry
>
> Tel: 0114 257 2200
>
> Mob: 07508 033896
>
> Fax: 0114 257 0022
>
> Web: _www.gpfsug.org <http://www.gpfsug.org>_
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
-------------------------
Ashish Thandavan

UNIX Support Computing Officer
Department of Computer Science
University of Oxford
Wolfson Building
Parks Road
Oxford OX1 3QD

Phone: 01865 610733
Email: ashish.thandavan at cs.ox.ac.uk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130411/079dd991/attachment.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 13:37:52 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 13:37:52 +0100
Subject: [gpfsug-discuss] CTDB woes
Message-ID: <51680020.4040509@ed.ac.uk>

Hi folks,

We've long been using CTDB and Samba for our NAS service, servicing ~500 
users. We've been suffering from some problems with the CTDB performance 
over the last few weeks, likely triggered either by an upgrade of samba 
from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by 
additional users coming on with a new workload.

We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from 
sernet). Before we roll back, we'd like to make sure we can't fix the 
problem and stick with Samba 3.6 (and we don't even know that a roll 
back would fix the issue).

The symptoms are a complete freeze of the service for CIFS users for 
10-60 seconds, and on the servers a corresponding spawning of large 
numbers of CTDB processes, which seem to be created in a "big bang", and 
then do what they do and exit in the subsequent 10-60 seconds.

We also serve up NFS from the same ctdb-managed frontends, and GPFS from 
the cluster - and these are both fine throughout.

This was happening 5-10 times per hour, not at exact intervals though. 
When we added a third node to the CTDB cluster, it "got worse", and when 
we dropped the CTDB cluster down to a single node and everything started 
behaving fine - which is where we are now.

So, I've got a bunch of questions!

  - does anyone know why ctdb would be spawning these processes, and if 
there's anything we can do to stop it needing to do it?
  - has anyone done any more general performance / config optimisation 
of CTDB?

And - more generally - does anyone else actually use ctdb/samba/gpfs on 
the scale of ~500 users or higher? If so - how do you find it?


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From Tobias.Kuebler at sva.de  Fri Apr 12 14:03:58 2013
From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de)
Date: Fri, 12 Apr 2013 15:03:58 +0200
Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?=
 =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?=
Message-ID: <OFA98F07BD.090D53A5-ONC1257B4B.0047C660-C1257B4B.0047C660@sva.de>


Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend.

Vielen Dank f?r Ihre Nachricht.
Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet,
ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu
beantworten.
In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen
Vertriebsbeauftragten.


Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht
"[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52.

Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend
diese Person abwesend ist.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130412/36d2a4aa/attachment.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 16:43:44 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 16:43:44 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
Message-ID: <51682BB0.7010507@ed.ac.uk>

On 12/04/13 15:43, Bob Cregan wrote:
> Hi Orlando,
>                        We use ctdb/samba for CIFS, and CNFS for NFS
> (GPFS version 3.4.0-13) . Current versions are
>
> ctdb - 1.0.99
> samba 3.5.15
>
> Both compiled from source. We have about 300+ users normally.
>

We have suspicions that 3.6 has put additional "chatter" into the ctdb 
database stream, which has pushed us over the edge. Barry Evans has 
found that the clustered locking databases, in particular, prove to be a 
scalability/usability limit for ctdb.


> We have had no issues with this setup apart from CNFS which had 2 or 3
> bad moments over the last year . These have gone away since we have
> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
> be2net) which lead to occasional dropped packets for jumbo frames. There
> have been no issues with samba/ctdb
>
> The only comment I can make is that during initial investigations into
> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
> with error messages like:
>
>   configure: checking whether cluster support is available
> checking for ctdb.h... yes
> checking for ctdb_private.h... yes
> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
> configure: error: "cluster support not available: support for
> SCHEDULE_FOR_DELETION control missing"
>
>
> What occurs to me is that this message seems to indicate that it is
> possible to run  a ctdb version that is incompatible with samba 3.6.
>   That would imply that an upgrade to a higher version of ctdb might
> help, of course it might not and make backing out harder.

Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The 
versioning in CTDB has proved hard for me to fathom...

>
> A compile against ctdb 2.0 works fine. We will soon be running in this
> upgrade, but I'm waiting to see what the samba  people say at the UG
> meeting first!
>

It has to be said - the timing is good!
Cheers,
Orlando

>
> Thanks
>
> Bob
>
>
> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
> <mailto:orlando.richards at ed.ac.uk>> wrote:
>
>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>
>     We've long been using CTDB and Samba for our NAS service, servicing
>     ~500 users. We've been suffering from some problems with the CTDB
>     performance over the last few weeks, likely triggered either by an
>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>     or possibly by additional users coming on with a new workload.
>
>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>     from sernet). Before we roll back, we'd like to make sure we can't
>     fix the problem and stick with Samba 3.6 (and we don't even know
>     that a roll back would fix the issue).
>
>     The symptoms are a complete freeze of the service for CIFS users for
>     10-60 seconds, and on the servers a corresponding spawning of large
>     numbers of CTDB processes, which seem to be created in a "big bang",
>     and then do what they do and exit in the subsequent 10-60 seconds.
>
>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>     from the cluster - and these are both fine throughout.
>
>     This was happening 5-10 times per hour, not at exact intervals
>     though. When we added a third node to the CTDB cluster, it "got
>     worse", and when we dropped the CTDB cluster down to a single node
>     and everything started behaving fine - which is where we are now.
>
>     So, I've got a bunch of questions!
>
>       - does anyone know why ctdb would be spawning these processes, and
>     if there's anything we can do to stop it needing to do it?
>       - has anyone done any more general performance / config
>     optimisation of CTDB?
>
>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>     on the scale of ~500 users or higher? If so - how do you find it?
>
>
>     --
>                  --
>         Dr Orlando Richards
>        Information Services
>     IT Infrastructure Division
>             Unix Section
>          Tel: 0131 650 4994
>
>     The University of Edinburgh is a charitable body, registered in
>     Scotland, with registration number SC005336.
>     _________________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
>
>
>
> --
>
> Bob Cregan
>
> Senior Storage Systems Administrator
>
> ACRC
>
> Bristol University
>
> Tel:     +44 (0) 117 331 4406
>
> skype:  bobcregan
>
> Mobile: +44 (0) 7712388129
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From viccornell at gmail.com  Fri Apr 12 19:44:16 2013
From: viccornell at gmail.com (Vic Cornell)
Date: Fri, 12 Apr 2013 19:44:16 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <51682BB0.7010507@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>

Have you tried putting the ctdb files onto a separate gpfs filesystem?

Vic Cornell
viccornell at gmail.com


On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 12/04/13 15:43, Bob Cregan wrote:
>> Hi Orlando,
>>                       We use ctdb/samba for CIFS, and CNFS for NFS
>> (GPFS version 3.4.0-13) . Current versions are
>> 
>> ctdb - 1.0.99
>> samba 3.5.15
>> 
>> Both compiled from source. We have about 300+ users normally.
>> 
> 
> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
> 
> 
>> We have had no issues with this setup apart from CNFS which had 2 or 3
>> bad moments over the last year . These have gone away since we have
>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>> be2net) which lead to occasional dropped packets for jumbo frames. There
>> have been no issues with samba/ctdb
>> 
>> The only comment I can make is that during initial investigations into
>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>> with error messages like:
>> 
>>  configure: checking whether cluster support is available
>> checking for ctdb.h... yes
>> checking for ctdb_private.h... yes
>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>> configure: error: "cluster support not available: support for
>> SCHEDULE_FOR_DELETION control missing"
>> 
>> 
>> What occurs to me is that this message seems to indicate that it is
>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>  That would imply that an upgrade to a higher version of ctdb might
>> help, of course it might not and make backing out harder.
> 
> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
> 
>> 
>> A compile against ctdb 2.0 works fine. We will soon be running in this
>> upgrade, but I'm waiting to see what the samba  people say at the UG
>> meeting first!
>> 
> 
> It has to be said - the timing is good!
> Cheers,
> Orlando
> 
>> 
>> Thanks
>> 
>> Bob
>> 
>> 
>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>> 
>>    Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>> 
>>    We've long been using CTDB and Samba for our NAS service, servicing
>>    ~500 users. We've been suffering from some problems with the CTDB
>>    performance over the last few weeks, likely triggered either by an
>>    upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>    or possibly by additional users coming on with a new workload.
>> 
>>    We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>    from sernet). Before we roll back, we'd like to make sure we can't
>>    fix the problem and stick with Samba 3.6 (and we don't even know
>>    that a roll back would fix the issue).
>> 
>>    The symptoms are a complete freeze of the service for CIFS users for
>>    10-60 seconds, and on the servers a corresponding spawning of large
>>    numbers of CTDB processes, which seem to be created in a "big bang",
>>    and then do what they do and exit in the subsequent 10-60 seconds.
>> 
>>    We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>    from the cluster - and these are both fine throughout.
>> 
>>    This was happening 5-10 times per hour, not at exact intervals
>>    though. When we added a third node to the CTDB cluster, it "got
>>    worse", and when we dropped the CTDB cluster down to a single node
>>    and everything started behaving fine - which is where we are now.
>> 
>>    So, I've got a bunch of questions!
>> 
>>      - does anyone know why ctdb would be spawning these processes, and
>>    if there's anything we can do to stop it needing to do it?
>>      - has anyone done any more general performance / config
>>    optimisation of CTDB?
>> 
>>    And - more generally - does anyone else actually use ctdb/samba/gpfs
>>    on the scale of ~500 users or higher? If so - how do you find it?
>> 
>> 
>>    --
>>                 --
>>        Dr Orlando Richards
>>       Information Services
>>    IT Infrastructure Division
>>            Unix Section
>>         Tel: 0131 650 4994
>> 
>>    The University of Edinburgh is a charitable body, registered in
>>    Scotland, with registration number SC005336.
>>    _________________________________________________
>>    gpfsug-discuss mailing list
>>    gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>    http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>    <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>> 
>> 
>> 
>> 
>> --
>> 
>> Bob Cregan
>> 
>> Senior Storage Systems Administrator
>> 
>> ACRC
>> 
>> Bristol University
>> 
>> Tel:     +44 (0) 117 331 4406
>> 
>> skype:  bobcregan
>> 
>> Mobile: +44 (0) 7712388129
>> 
> 
> 
> -- 
>            --
>   Dr Orlando Richards
>  Information Services
> IT Infrastructure Division
>       Unix Section
>    Tel: 0131 650 4994
> 
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


From orlando.richards at ed.ac.uk  Mon Apr 15 10:54:39 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 15 Apr 2013 10:54:39 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
Message-ID: <516BCE5F.8010309@ed.ac.uk>

On 12/04/13 19:44, Vic Cornell wrote:
> Have you tried putting the ctdb files onto a separate gpfs filesystem?

No - but considered it. However, the only "live" CTDB file that sits on 
GPFS is the reclock file, which - I think - is only used as the 
heartbeat between nodes and for the recovery process. Now, there's 
mileage in insulating that, certainly, but I don't think that's what 
we're suffering from here.

On a positive note - we took the steps this morning to re-initialise the 
ctdb databases from current data, and things seem to be stable today so far.

Basically - shut down ctdb on all but one node. On all but that node, do:
mv /var/ctdb/ /var/ctdb.save.date

then start up ctdb on those nodes. Once they've come up, shut down ctdb 
on the last node, move /var/ctdb out the way, and restart. That brings 
them all up with freshly compacted databases.

Also, from the samba-technical mailing list came the advice to use a 
more recent ctdb - specifically, 1.2.61. I've got that built and ready 
to go (and a rebuilt samba compiled against it too), but if things prove 
to be stable after today's compacting, then we will probably leave it at 
that and not deploy this.

Interesting that 2.0 wasn't suggested for "stable", and that the current 
"dev" version is 2.1.

For reference, here's the start of the thread:
https://lists.samba.org/archive/samba-technical/2013-April/091525.html

--
Orlando.


>
> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:
>
>> On 12/04/13 15:43, Bob Cregan wrote:
>>> Hi Orlando,
>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>> (GPFS version 3.4.0-13) . Current versions are
>>>
>>> ctdb - 1.0.99
>>> samba 3.5.15
>>>
>>> Both compiled from source. We have about 300+ users normally.
>>>
>>
>> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
>>
>>
>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>> bad moments over the last year . These have gone away since we have
>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>> be2net) which lead to occasional dropped packets for jumbo frames. There
>>> have been no issues with samba/ctdb
>>>
>>> The only comment I can make is that during initial investigations into
>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>> with error messages like:
>>>
>>>   configure: checking whether cluster support is available
>>> checking for ctdb.h... yes
>>> checking for ctdb_private.h... yes
>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>> configure: error: "cluster support not available: support for
>>> SCHEDULE_FOR_DELETION control missing"
>>>
>>>
>>> What occurs to me is that this message seems to indicate that it is
>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>   That would imply that an upgrade to a higher version of ctdb might
>>> help, of course it might not and make backing out harder.
>>
>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
>>
>>>
>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>> meeting first!
>>>
>>
>> It has to be said - the timing is good!
>> Cheers,
>> Orlando
>>
>>>
>>> Thanks
>>>
>>> Bob
>>>
>>>
>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>
>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>
>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>     performance over the last few weeks, likely triggered either by an
>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>>     or possibly by additional users coming on with a new workload.
>>>
>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>     that a roll back would fix the issue).
>>>
>>>     The symptoms are a complete freeze of the service for CIFS users for
>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>     numbers of CTDB processes, which seem to be created in a "big bang",
>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>
>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>     from the cluster - and these are both fine throughout.
>>>
>>>     This was happening 5-10 times per hour, not at exact intervals
>>>     though. When we added a third node to the CTDB cluster, it "got
>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>     and everything started behaving fine - which is where we are now.
>>>
>>>     So, I've got a bunch of questions!
>>>
>>>       - does anyone know why ctdb would be spawning these processes, and
>>>     if there's anything we can do to stop it needing to do it?
>>>       - has anyone done any more general performance / config
>>>     optimisation of CTDB?
>>>
>>>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>
>>>
>>>     --
>>>                  --
>>>         Dr Orlando Richards
>>>        Information Services
>>>     IT Infrastructure Division
>>>             Unix Section
>>>          Tel: 0131 650 4994
>>>
>>>     The University of Edinburgh is a charitable body, registered in
>>>     Scotland, with registration number SC005336.
>>>     _________________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bob Cregan
>>>
>>> Senior Storage Systems Administrator
>>>
>>> ACRC
>>>
>>> Bristol University
>>>
>>> Tel:     +44 (0) 117 331 4406
>>>
>>> skype:  bobcregan
>>>
>>> Mobile: +44 (0) 7712388129
>>>
>>
>>
>> --
>>             --
>>    Dr Orlando Richards
>>   Information Services
>> IT Infrastructure Division
>>        Unix Section
>>     Tel: 0131 650 4994
>>
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From crobson at ocf.co.uk  Mon Apr 15 15:04:38 2013
From: crobson at ocf.co.uk (Claire Robson)
Date: Mon, 15 Apr 2013 15:04:38 +0100
Subject: [gpfsug-discuss] Latest agenda and places still available
Message-ID: <A42128435E851644B9B011BB824F6C81614669752C@MAIL.ocf.local>

Dear All,

Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend.

This is the latest agenda for the day:
10:30     Arrivals and refreshments
11:00     Introductions and committee updates
Jez Tucker, Group Chair & Claire Robson, Group Secretary
11:05     GPFS FPO
Dinesh Subhraveti, IBM Almaden Research Labs
12:00     SAMBA 4.0 & CTDB 2.0
               Michael Adams, SAMBA Development Team
13:00     Lunch (Buffet provided)
13:45     GPFS OpenStack Integration
               Dinesh Subhraveti, IBM Almaden Research Labs
14:15     SAMBA & GPFS Integration
               Volker Lendecke, SAMBA Development Team
15:15     Refreshments break
15:30     GPFS Native RAID & LTFS
Jim Roche, IBM
16:00     Group discussion: Questions & Committee matters
Led by Jez Tucker, Group Chairperson
16:05     Close

I look forward to seeing many of you next week.

Kind regards,

Claire Robson
GPFS user group Secetary

Tel: 0114 257 2200
Mob: 07508 033896
Web: www.gpfsug.org<http://www.gpfsug.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130415/0c87469e/attachment.htm>

From AHMADYH at sa.ibm.com  Tue Apr 16 13:08:58 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Tue, 16 Apr 2013 16:08:58 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/29/2013)
Message-ID: <OF891AC77C.D4A142FB-ON44257B4F.0042BD8B-44257B4F.0042BD8B@ae.ibm.com>


I am out of the office until 04/29/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From orlando.richards at ed.ac.uk  Wed Apr 17 11:30:32 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Wed, 17 Apr 2013 11:30:32 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516BCE5F.8010309@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk>
Message-ID: <516E79C8.8090603@ed.ac.uk>

Hi All - an update to this,

After re-initialising the databases on Monday, things did seem to be 
running better, but ultimately we got back to suffering from spikes in 
ctdb processes and corresponding "pauses" in service. We fell back to a 
single node again for Tuesday (and things were stable once again), and 
this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was 
rebuilt against CTDB 1.2.61 headers).

Things seem to be stable for now - more so than on Monday.

For the record - one metric I'm watching is the number of ctdb processes 
running (this would spike to > 1000 under the failure conditions). It's 
currently sitting consistently at 3 processes, with occasional blips of 
5-7 processes.

--
Orlando


On 15/04/13 10:54, Orlando Richards wrote:
> On 12/04/13 19:44, Vic Cornell wrote:
>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>
> No - but considered it. However, the only "live" CTDB file that sits on
> GPFS is the reclock file, which - I think - is only used as the
> heartbeat between nodes and for the recovery process. Now, there's
> mileage in insulating that, certainly, but I don't think that's what
> we're suffering from here.
>
> On a positive note - we took the steps this morning to re-initialise the
> ctdb databases from current data, and things seem to be stable today so
> far.
>
> Basically - shut down ctdb on all but one node. On all but that node, do:
> mv /var/ctdb/ /var/ctdb.save.date
>
> then start up ctdb on those nodes. Once they've come up, shut down ctdb
> on the last node, move /var/ctdb out the way, and restart. That brings
> them all up with freshly compacted databases.
>
> Also, from the samba-technical mailing list came the advice to use a
> more recent ctdb - specifically, 1.2.61. I've got that built and ready
> to go (and a rebuilt samba compiled against it too), but if things prove
> to be stable after today's compacting, then we will probably leave it at
> that and not deploy this.
>
> Interesting that 2.0 wasn't suggested for "stable", and that the current
> "dev" version is 2.1.
>
> For reference, here's the start of the thread:
> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>
> --
> Orlando.
>
>
>
>>
>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>> wrote:
>>
>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>> Hi Orlando,
>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>
>>>> ctdb - 1.0.99
>>>> samba 3.5.15
>>>>
>>>> Both compiled from source. We have about 300+ users normally.
>>>>
>>>
>>> We have suspicions that 3.6 has put additional "chatter" into the
>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>> has found that the clustered locking databases, in particular, prove
>>> to be a scalability/usability limit for ctdb.
>>>
>>>
>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>> bad moments over the last year . These have gone away since we have
>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>> There
>>>> have been no issues with samba/ctdb
>>>>
>>>> The only comment I can make is that during initial investigations into
>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>> with error messages like:
>>>>
>>>>   configure: checking whether cluster support is available
>>>> checking for ctdb.h... yes
>>>> checking for ctdb_private.h... yes
>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>> configure: error: "cluster support not available: support for
>>>> SCHEDULE_FOR_DELETION control missing"
>>>>
>>>>
>>>> What occurs to me is that this message seems to indicate that it is
>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>> help, of course it might not and make backing out harder.
>>>
>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>> The versioning in CTDB has proved hard for me to fathom...
>>>
>>>>
>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>> meeting first!
>>>>
>>>
>>> It has to be said - the timing is good!
>>> Cheers,
>>> Orlando
>>>
>>>>
>>>> Thanks
>>>>
>>>> Bob
>>>>
>>>>
>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>
>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>
>>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>     performance over the last few weeks, likely triggered either by an
>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>> result),
>>>>     or possibly by additional users coming on with a new workload.
>>>>
>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>     that a roll back would fix the issue).
>>>>
>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>> for
>>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>> bang",
>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>
>>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>>     from the cluster - and these are both fine throughout.
>>>>
>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>     and everything started behaving fine - which is where we are now.
>>>>
>>>>     So, I've got a bunch of questions!
>>>>
>>>>       - does anyone know why ctdb would be spawning these processes,
>>>> and
>>>>     if there's anything we can do to stop it needing to do it?
>>>>       - has anyone done any more general performance / config
>>>>     optimisation of CTDB?
>>>>
>>>>     And - more generally - does anyone else actually use
>>>> ctdb/samba/gpfs
>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>
>>>>
>>>>     --
>>>>                  --
>>>>         Dr Orlando Richards
>>>>        Information Services
>>>>     IT Infrastructure Division
>>>>             Unix Section
>>>>          Tel: 0131 650 4994
>>>>
>>>>     The University of Edinburgh is a charitable body, registered in
>>>>     Scotland, with registration number SC005336.
>>>>     _________________________________________________
>>>>     gpfsug-discuss mailing list
>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bob Cregan
>>>>
>>>> Senior Storage Systems Administrator
>>>>
>>>> ACRC
>>>>
>>>> Bristol University
>>>>
>>>> Tel:     +44 (0) 117 331 4406
>>>>
>>>> skype:  bobcregan
>>>>
>>>> Mobile: +44 (0) 7712388129
>>>>
>>>
>>>
>>> --
>>>             --
>>>    Dr Orlando Richards
>>>   Information Services
>>> IT Infrastructure Division
>>>        Unix Section
>>>     Tel: 0131 650 4994
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From orlando.richards at ed.ac.uk  Mon Apr 22 15:52:55 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 22 Apr 2013 15:52:55 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516E79C8.8090603@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk>
Message-ID: <51754EC7.8000600@ed.ac.uk>

On 17/04/13 11:30, Orlando Richards wrote:
> Hi All - an update to this,
>
> After re-initialising the databases on Monday, things did seem to be
> running better, but ultimately we got back to suffering from spikes in
> ctdb processes and corresponding "pauses" in service. We fell back to a
> single node again for Tuesday (and things were stable once again), and
> this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was
> rebuilt against CTDB 1.2.61 headers).
>
> Things seem to be stable for now - more so than on Monday.
>
> For the record - one metric I'm watching is the number of ctdb processes
> running (this would spike to > 1000 under the failure conditions). It's
> currently sitting consistently at 3 processes, with occasional blips of
> 5-7 processes.
>


Hi all,

Looks like things have been running fine since we upgraded ctdb last 
Wednesday, so I think it's safe to say that we've found a fix for our 
problem in CTDB 1.2.61.

Thanks for all the input! If anyone wants more info, feel free to get in 
touch.


--
Orlando

> --
> Orlando
>
>
>
>
>
> On 15/04/13 10:54, Orlando Richards wrote:
>> On 12/04/13 19:44, Vic Cornell wrote:
>>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>>
>> No - but considered it. However, the only "live" CTDB file that sits on
>> GPFS is the reclock file, which - I think - is only used as the
>> heartbeat between nodes and for the recovery process. Now, there's
>> mileage in insulating that, certainly, but I don't think that's what
>> we're suffering from here.
>>
>> On a positive note - we took the steps this morning to re-initialise the
>> ctdb databases from current data, and things seem to be stable today so
>> far.
>>
>> Basically - shut down ctdb on all but one node. On all but that node, do:
>> mv /var/ctdb/ /var/ctdb.save.date
>>
>> then start up ctdb on those nodes. Once they've come up, shut down ctdb
>> on the last node, move /var/ctdb out the way, and restart. That brings
>> them all up with freshly compacted databases.
>>
>> Also, from the samba-technical mailing list came the advice to use a
>> more recent ctdb - specifically, 1.2.61. I've got that built and ready
>> to go (and a rebuilt samba compiled against it too), but if things prove
>> to be stable after today's compacting, then we will probably leave it at
>> that and not deploy this.
>>
>> Interesting that 2.0 wasn't suggested for "stable", and that the current
>> "dev" version is 2.1.
>>
>> For reference, here's the start of the thread:
>> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>>
>> --
>> Orlando.
>>
>>
>>
>>>
>>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>>> wrote:
>>>
>>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>>> Hi Orlando,
>>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>>
>>>>> ctdb - 1.0.99
>>>>> samba 3.5.15
>>>>>
>>>>> Both compiled from source. We have about 300+ users normally.
>>>>>
>>>>
>>>> We have suspicions that 3.6 has put additional "chatter" into the
>>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>>> has found that the clustered locking databases, in particular, prove
>>>> to be a scalability/usability limit for ctdb.
>>>>
>>>>
>>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>>> bad moments over the last year . These have gone away since we have
>>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>>> There
>>>>> have been no issues with samba/ctdb
>>>>>
>>>>> The only comment I can make is that during initial investigations into
>>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>>> with error messages like:
>>>>>
>>>>>   configure: checking whether cluster support is available
>>>>> checking for ctdb.h... yes
>>>>> checking for ctdb_private.h... yes
>>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>>> configure: error: "cluster support not available: support for
>>>>> SCHEDULE_FOR_DELETION control missing"
>>>>>
>>>>>
>>>>> What occurs to me is that this message seems to indicate that it is
>>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>>> help, of course it might not and make backing out harder.
>>>>
>>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>>> The versioning in CTDB has proved hard for me to fathom...
>>>>
>>>>>
>>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>>> meeting first!
>>>>>
>>>>
>>>> It has to be said - the timing is good!
>>>> Cheers,
>>>> Orlando
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>>
>>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>>
>>>>>     We've long been using CTDB and Samba for our NAS service,
>>>>> servicing
>>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>>     performance over the last few weeks, likely triggered either by an
>>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>>> result),
>>>>>     or possibly by additional users coming on with a new workload.
>>>>>
>>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>>     that a roll back would fix the issue).
>>>>>
>>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>>> for
>>>>>     10-60 seconds, and on the servers a corresponding spawning of
>>>>> large
>>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>>> bang",
>>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>>
>>>>>     We also serve up NFS from the same ctdb-managed frontends, and
>>>>> GPFS
>>>>>     from the cluster - and these are both fine throughout.
>>>>>
>>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>>     and everything started behaving fine - which is where we are now.
>>>>>
>>>>>     So, I've got a bunch of questions!
>>>>>
>>>>>       - does anyone know why ctdb would be spawning these processes,
>>>>> and
>>>>>     if there's anything we can do to stop it needing to do it?
>>>>>       - has anyone done any more general performance / config
>>>>>     optimisation of CTDB?
>>>>>
>>>>>     And - more generally - does anyone else actually use
>>>>> ctdb/samba/gpfs
>>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>>
>>>>>
>>>>>     --
>>>>>                  --
>>>>>         Dr Orlando Richards
>>>>>        Information Services
>>>>>     IT Infrastructure Division
>>>>>             Unix Section
>>>>>          Tel: 0131 650 4994
>>>>>
>>>>>     The University of Edinburgh is a charitable body, registered in
>>>>>     Scotland, with registration number SC005336.
>>>>>     _________________________________________________
>>>>>     gpfsug-discuss mailing list
>>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bob Cregan
>>>>>
>>>>> Senior Storage Systems Administrator
>>>>>
>>>>> ACRC
>>>>>
>>>>> Bristol University
>>>>>
>>>>> Tel:     +44 (0) 117 331 4406
>>>>>
>>>>> skype:  bobcregan
>>>>>
>>>>> Mobile: +44 (0) 7712388129
>>>>>
>>>>
>>>>
>>>> --
>>>>             --
>>>>    Dr Orlando Richards
>>>>   Information Services
>>>> IT Infrastructure Division
>>>>        Unix Section
>>>>     Tel: 0131 650 4994
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 10:38:07 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 10:38:07 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
Message-ID: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>

Hi all

Good to see lots of you at the user group meeting yesterday. Great work,
Jez!

We're setting up a test cluster here at Realise, with a view to moving our
main storage over from Gluster.

We're running the test cluster on Isilon hardware ... a couple of 1920
nodes that we were using for home dirs. Each node has dual gigabit ethernet
ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB
RAM. All good stuff and should make a nice test rig.

I have a few questions!

1.  We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah
to 3.5?
2.  I'm having trouble assigning NSDs. I have a descfile which looks like:

#DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1

but the command

"mmcrnsd -F /tmp/descfile -v no"

just craps out with

mmcrnsd: Processing disk sdc1
mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license
designation.
mmcrnsd: Error found while checking disk descriptor
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
mmcrnsd: Command failed.  Examine previous error messages to determine
cause.

Any help pointing me gently in the right direction would be much
appreciated. :-)

TIA

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/671e3a7e/attachment.htm>

From orlando.richards at ed.ac.uk  Thu Apr 25 10:48:30 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Thu, 25 Apr 2013 10:48:30 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <5178FBEE.4070200@ed.ac.uk>

On 25/04/13 10:38, Pete Smith wrote:
> Hi all
>
> Good to see lots of you at the user group meeting yesterday. Great work,
> Jez!
>
> We're setting up a test cluster here at Realise, with a view to moving
> our main storage over from Gluster.
>
> We're running the test cluster on Isilon hardware ... a couple of 1920
> nodes that we were using for home dirs. Each node has dual gigabit
> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
> and and 4GB RAM. All good stuff and should make a nice test rig.
>
> I have a few questions!
>
> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?
> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>
> #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
>
> but the command
>
> "mmcrnsd -F /tmp/descfile -v no"
>
> just craps out with
>
> mmcrnsd: Processing disk sdc1
> mmcrnsd: Node gpfs001.realisestudio.com
> <http://gpfs001.realisestudio.com> does not have a GPFS server license
> designation.
> mmcrnsd: Error found while checking disk descriptor
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
> mmcrnsd: Command failed.  Examine previous error messages to determine
> cause.
>

mmchlicense server -N gpfs001.realisestudio.com should sort that one out.


> Any help pointing me gently in the right direction would be much
> appreciated. :-)
>
> TIA
>
> --
> Pete Smith
> DevOp/System Administrator
> Realise Studio
> 12/13 Poland Street, London W1F 8QB
> T. +44 (0)20 7165 9644
>
> realisestudio.com <http://realisestudio.com>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 11:05:36 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 11:05:36 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <5178FBEE.4070200@ed.ac.uk>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
	<5178FBEE.4070200@ed.ac.uk>
Message-ID: <CAM9ZKkiZ-7xUKZT3TyNDHJByufV+aqF+UQN=8EUw_pF2y8D=JA@mail.gmail.com>

Thanks Orlando. Much appreciated.


On 25 April 2013 10:48, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 25/04/13 10:38, Pete Smith wrote:
>
>> Hi all
>>
>> Good to see lots of you at the user group meeting yesterday. Great work,
>> Jez!
>>
>> We're setting up a test cluster here at Realise, with a view to moving
>> our main storage over from Gluster.
>>
>> We're running the test cluster on Isilon hardware ... a couple of 1920
>> nodes that we were using for home dirs. Each node has dual gigabit
>> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
>> and and 4GB RAM. All good stuff and should make a nice test rig.
>>
>> I have a few questions!
>>
>> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
>> 3.3.blah to 3.5?
>> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>>
>> #DiskName:PrimaryServer:**BackupServer:DiskUsage:**
>> FailureGroup:DesiredName:**StoragePool
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>>
>> but the command
>>
>> "mmcrnsd -F /tmp/descfile -v no"
>>
>> just craps out with
>>
>> mmcrnsd: Processing disk sdc1
>> mmcrnsd: Node gpfs001.realisestudio.com
>> <http://gpfs001.realisestudio.**com <http://gpfs001.realisestudio.com>>
>> does not have a GPFS server license
>> designation.
>> mmcrnsd: Error found while checking disk descriptor
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>> mmcrnsd: Command failed.  Examine previous error messages to determine
>> cause.
>>
>>
> mmchlicense server -N gpfs001.realisestudio.com should sort that one out.
>
>
>  Any help pointing me gently in the right direction would be much
>> appreciated. :-)
>>
>> TIA
>>
>> --
>> Pete Smith
>> DevOp/System Administrator
>> Realise Studio
>> 12/13 Poland Street, London W1F 8QB
>> T. +44 (0)20 7165 9644
>>
>> realisestudio.com <http://realisestudio.com>
>>
>>
>> ______________________________**_________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>
>>
>
> --
>             --
>    Dr Orlando Richards
>   Information Services
> IT Infrastructure Division
>        Unix Section
>     Tel: 0131 650 4994
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> ______________________________**_________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>


-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/c132d89e/attachment.htm>

From pete at realisestudio.com  Fri Apr 26 16:06:38 2013
From: pete at realisestudio.com (Pete Smith)
Date: Fri, 26 Apr 2013 16:06:38 +0100
Subject: [gpfsug-discuss] GPS Native RAID on linux?
Message-ID: <CAM9ZKkh0kGKKNnxBoogcN_y-TUyKGUypu-io8qfgs1rXrV0GnQ@mail.gmail.com>

Hi

I thought from the presentation that this was available on linux ... but
documentation seems to indicate IBM GSS only?

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130426/872c1ad8/attachment.htm>

From stuartb at 4gh.net  Tue Apr 30 21:50:38 2013
From: stuartb at 4gh.net (Stuart Barkley)
Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT)
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <alpine.BSF.2.00.1304301643540.4313@freeman.4gh.net>

On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote:

> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?

We are in transition to 3.5 on our original GPFS installation.  Two of
four servers are now at GPFS 3.4.XX/CentOS 6.4.  Two servers are still
at 3.3.YY/CentOS 5.4.  The compute nodes are all to 3.4.XX/CentOS 6.4.

The data center is remotely located and it is a pain to get physical
access.  Once we get the last two nodes upgraded, we expect to go to
GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS
installation).

My understanding is that you need to step through 3.4 during a
migration from 3.3 to 3.5.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone


From bdeluca at gmail.com  Wed Apr  3 10:57:05 2013
From: bdeluca at gmail.com (Ben De Luca)
Date: Wed, 3 Apr 2013 10:57:05 +0100
Subject: [gpfsug-discuss] mmbackup and management classes
Message-ID: <CAGC__DhH-O8oxSY8t=xXyfV7+M6x-sfjwGP7AN3c_YKrDWm-Ug@mail.gmail.com>

Hi gpfsusers,
       My first post to the list, Hi!

We tsm for our backups of our gpfs filesystems, we are looking at using the
mmbackup for script for launching our backups.

>From conversations with other people we hear that support for  management
classes may not be completely available in mmbackup?

I wondered if any one could comment on using mmbackup, and what and what
not is supported. Any gotchas?


-bd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130403/1f70a7e1/attachment-0001.htm>

From AHMADYH at sa.ibm.com  Wed Apr  3 13:04:47 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Wed, 3 Apr 2013 16:04:47 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/08/2013)
Message-ID: <OF2EDD4A8D.D01283B2-ON44257B42.00425B77-44257B42.00425B77@ae.ibm.com>


I am out of the office until 04/08/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From chris_stone at uk.ibm.com  Wed Apr  3 16:08:39 2013
From: chris_stone at uk.ibm.com (Chris Stone)
Date: Wed, 3 Apr 2013 16:08:39 +0100
Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office
 until 16/08/2004. (returning 11/04/2013)
Message-ID: <OF7A8A96D8.7012B265-ON80257B42.005330B4-80257B42.005330B4@uk.ibm.com>


I am out of the office until 11/04/2013.

In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017
418


Note: This is an automated response to your message  "[gpfsug-discuss]
mmbackup and management classes" sent on 03/04/2013 10:57:05.

This is the only notification you will receive while this person is away.


From ANDREWD at uk.ibm.com  Wed Apr  3 16:10:26 2013
From: ANDREWD at uk.ibm.com (Andrew Downes1)
Date: Wed, 3 Apr 2013 16:10:26 +0100
Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office
	(returning 08/04/2013)
Message-ID: <OFB0FCE5CF.87D63AFD-ON80257B42.00535A6C-80257B42.00535A6C@uk.ibm.com>


I am out of the office until 08/04/2013.

If anything is too urgent to wait  for my return please contact Matt Ayres
mailto:m_ayres at uk.ibm.com 44-7710-981527

In case of urgency, please contact our manager Dave Shave-Wall
mailto:dave_shavewall at uk.ibm.com 44-7740-921623


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02.

This is the only notification you will receive while this person is away.


From ashish.thandavan at cs.ox.ac.uk  Thu Apr 11 10:58:41 2013
From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan)
Date: Thu, 11 Apr 2013 10:58:41 +0100
Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged
In-Reply-To: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
References: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
Message-ID: <51668951.7040506@cs.ox.ac.uk>

Dear Claire,

I trust you are well! If there are any spaces left, could you please 
register me for the event?

Thank you!

Regards,
Ash

On 25/03/13 14:38, Claire Robson wrote:
>
> Dear All,
>
> The next meeting date is set for *Wednesday 24^th April* and will be 
> taking place at the fantastic Dolby Studios in London (Dolby Europe 
> Limited, 4--6 Soho Square, London W1D 3PZ).
>
> *Getting to Dolby Europe Limited, Soho Square, London*
>
> Leave the Tottenham Court Road tube station by the South Oxford Street 
> exit [Exit 1].
>
> Turn left onto Oxford Street.
>
> After about 50m turn left into Soho Street.
>
> Turn right into Soho Square.
>
> 4-6 Soho Square is directly in front of you.
>
> Our tentative agenda is as follows:
>
> 10:30     Arrivals and refreshments
>
> 11:00     Introductions and committee updates
>
> Jez Tucker, Group Chair & Claire Robson, Group Secretary
>
> 11:05     GPFS OpenStack Integration
>
> Prasenhit Sarkar, IBM Almaden Research Labs
>
>                GPFS FPO
>
>                Dinesh Subhraveti, IBM Almaden Research Labs
>
> 11:45     SAMBA 4.0 & CTDB 2.0
>
>                Michael Adams, SAMBA Development Team
>
> 12:15     SAMBA & GPFS Integration
>
>                Volker Lendecke, SAMBA Development Team
>
> 13:00     Lunch (Buffet provided)
>
> 14:00     GPFS Native RAID & LTFS
>
> Jim Roche, IBM
>
> 14:45     User Stories
>
> 15:45     Group discussion: Challenges, experiences and questions & 
> Committee matters
>
> Led by Jez Tucker, Group Chairperson
>
> 16:00     Close
>
> We will be starting at 11:00am and concluding at 4pm but some of the 
> speaker timings may alter slightly. I will be posting further details 
> on what the presentations cover over the coming week or so.
>
> We hope you can make it for what will be a really interesting day of 
> GPFS discussions. *Please register with me if you would like to 
> attend* -- registrations are based on a first come first served basis.
>
> Best regards,
>
> *Claire Robson*
>
> GPFS User Group Secreatry
>
> Tel: 0114 257 2200
>
> Mob: 07508 033896
>
> Fax: 0114 257 0022
>
> Web: _www.gpfsug.org <http://www.gpfsug.org>_
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
-------------------------
Ashish Thandavan

UNIX Support Computing Officer
Department of Computer Science
University of Oxford
Wolfson Building
Parks Road
Oxford OX1 3QD

Phone: 01865 610733
Email: ashish.thandavan at cs.ox.ac.uk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130411/079dd991/attachment-0001.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 13:37:52 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 13:37:52 +0100
Subject: [gpfsug-discuss] CTDB woes
Message-ID: <51680020.4040509@ed.ac.uk>

Hi folks,

We've long been using CTDB and Samba for our NAS service, servicing ~500 
users. We've been suffering from some problems with the CTDB performance 
over the last few weeks, likely triggered either by an upgrade of samba 
from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by 
additional users coming on with a new workload.

We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from 
sernet). Before we roll back, we'd like to make sure we can't fix the 
problem and stick with Samba 3.6 (and we don't even know that a roll 
back would fix the issue).

The symptoms are a complete freeze of the service for CIFS users for 
10-60 seconds, and on the servers a corresponding spawning of large 
numbers of CTDB processes, which seem to be created in a "big bang", and 
then do what they do and exit in the subsequent 10-60 seconds.

We also serve up NFS from the same ctdb-managed frontends, and GPFS from 
the cluster - and these are both fine throughout.

This was happening 5-10 times per hour, not at exact intervals though. 
When we added a third node to the CTDB cluster, it "got worse", and when 
we dropped the CTDB cluster down to a single node and everything started 
behaving fine - which is where we are now.

So, I've got a bunch of questions!

  - does anyone know why ctdb would be spawning these processes, and if 
there's anything we can do to stop it needing to do it?
  - has anyone done any more general performance / config optimisation 
of CTDB?

And - more generally - does anyone else actually use ctdb/samba/gpfs on 
the scale of ~500 users or higher? If so - how do you find it?


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From Tobias.Kuebler at sva.de  Fri Apr 12 14:03:58 2013
From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de)
Date: Fri, 12 Apr 2013 15:03:58 +0200
Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?=
 =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?=
Message-ID: <OFA98F07BD.090D53A5-ONC1257B4B.0047C660-C1257B4B.0047C660@sva.de>


Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend.

Vielen Dank f?r Ihre Nachricht.
Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet,
ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu
beantworten.
In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen
Vertriebsbeauftragten.


Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht
"[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52.

Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend
diese Person abwesend ist.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130412/36d2a4aa/attachment-0001.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 16:43:44 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 16:43:44 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
Message-ID: <51682BB0.7010507@ed.ac.uk>

On 12/04/13 15:43, Bob Cregan wrote:
> Hi Orlando,
>                        We use ctdb/samba for CIFS, and CNFS for NFS
> (GPFS version 3.4.0-13) . Current versions are
>
> ctdb - 1.0.99
> samba 3.5.15
>
> Both compiled from source. We have about 300+ users normally.
>

We have suspicions that 3.6 has put additional "chatter" into the ctdb 
database stream, which has pushed us over the edge. Barry Evans has 
found that the clustered locking databases, in particular, prove to be a 
scalability/usability limit for ctdb.


> We have had no issues with this setup apart from CNFS which had 2 or 3
> bad moments over the last year . These have gone away since we have
> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
> be2net) which lead to occasional dropped packets for jumbo frames. There
> have been no issues with samba/ctdb
>
> The only comment I can make is that during initial investigations into
> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
> with error messages like:
>
>   configure: checking whether cluster support is available
> checking for ctdb.h... yes
> checking for ctdb_private.h... yes
> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
> configure: error: "cluster support not available: support for
> SCHEDULE_FOR_DELETION control missing"
>
>
> What occurs to me is that this message seems to indicate that it is
> possible to run  a ctdb version that is incompatible with samba 3.6.
>   That would imply that an upgrade to a higher version of ctdb might
> help, of course it might not and make backing out harder.

Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The 
versioning in CTDB has proved hard for me to fathom...

>
> A compile against ctdb 2.0 works fine. We will soon be running in this
> upgrade, but I'm waiting to see what the samba  people say at the UG
> meeting first!
>

It has to be said - the timing is good!
Cheers,
Orlando

>
> Thanks
>
> Bob
>
>
> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
> <mailto:orlando.richards at ed.ac.uk>> wrote:
>
>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>
>     We've long been using CTDB and Samba for our NAS service, servicing
>     ~500 users. We've been suffering from some problems with the CTDB
>     performance over the last few weeks, likely triggered either by an
>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>     or possibly by additional users coming on with a new workload.
>
>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>     from sernet). Before we roll back, we'd like to make sure we can't
>     fix the problem and stick with Samba 3.6 (and we don't even know
>     that a roll back would fix the issue).
>
>     The symptoms are a complete freeze of the service for CIFS users for
>     10-60 seconds, and on the servers a corresponding spawning of large
>     numbers of CTDB processes, which seem to be created in a "big bang",
>     and then do what they do and exit in the subsequent 10-60 seconds.
>
>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>     from the cluster - and these are both fine throughout.
>
>     This was happening 5-10 times per hour, not at exact intervals
>     though. When we added a third node to the CTDB cluster, it "got
>     worse", and when we dropped the CTDB cluster down to a single node
>     and everything started behaving fine - which is where we are now.
>
>     So, I've got a bunch of questions!
>
>       - does anyone know why ctdb would be spawning these processes, and
>     if there's anything we can do to stop it needing to do it?
>       - has anyone done any more general performance / config
>     optimisation of CTDB?
>
>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>     on the scale of ~500 users or higher? If so - how do you find it?
>
>
>     --
>                  --
>         Dr Orlando Richards
>        Information Services
>     IT Infrastructure Division
>             Unix Section
>          Tel: 0131 650 4994
>
>     The University of Edinburgh is a charitable body, registered in
>     Scotland, with registration number SC005336.
>     _________________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
>
>
>
> --
>
> Bob Cregan
>
> Senior Storage Systems Administrator
>
> ACRC
>
> Bristol University
>
> Tel:     +44 (0) 117 331 4406
>
> skype:  bobcregan
>
> Mobile: +44 (0) 7712388129
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From viccornell at gmail.com  Fri Apr 12 19:44:16 2013
From: viccornell at gmail.com (Vic Cornell)
Date: Fri, 12 Apr 2013 19:44:16 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <51682BB0.7010507@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>

Have you tried putting the ctdb files onto a separate gpfs filesystem?

Vic Cornell
viccornell at gmail.com


On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 12/04/13 15:43, Bob Cregan wrote:
>> Hi Orlando,
>>                       We use ctdb/samba for CIFS, and CNFS for NFS
>> (GPFS version 3.4.0-13) . Current versions are
>> 
>> ctdb - 1.0.99
>> samba 3.5.15
>> 
>> Both compiled from source. We have about 300+ users normally.
>> 
> 
> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
> 
> 
>> We have had no issues with this setup apart from CNFS which had 2 or 3
>> bad moments over the last year . These have gone away since we have
>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>> be2net) which lead to occasional dropped packets for jumbo frames. There
>> have been no issues with samba/ctdb
>> 
>> The only comment I can make is that during initial investigations into
>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>> with error messages like:
>> 
>>  configure: checking whether cluster support is available
>> checking for ctdb.h... yes
>> checking for ctdb_private.h... yes
>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>> configure: error: "cluster support not available: support for
>> SCHEDULE_FOR_DELETION control missing"
>> 
>> 
>> What occurs to me is that this message seems to indicate that it is
>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>  That would imply that an upgrade to a higher version of ctdb might
>> help, of course it might not and make backing out harder.
> 
> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
> 
>> 
>> A compile against ctdb 2.0 works fine. We will soon be running in this
>> upgrade, but I'm waiting to see what the samba  people say at the UG
>> meeting first!
>> 
> 
> It has to be said - the timing is good!
> Cheers,
> Orlando
> 
>> 
>> Thanks
>> 
>> Bob
>> 
>> 
>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>> 
>>    Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>> 
>>    We've long been using CTDB and Samba for our NAS service, servicing
>>    ~500 users. We've been suffering from some problems with the CTDB
>>    performance over the last few weeks, likely triggered either by an
>>    upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>    or possibly by additional users coming on with a new workload.
>> 
>>    We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>    from sernet). Before we roll back, we'd like to make sure we can't
>>    fix the problem and stick with Samba 3.6 (and we don't even know
>>    that a roll back would fix the issue).
>> 
>>    The symptoms are a complete freeze of the service for CIFS users for
>>    10-60 seconds, and on the servers a corresponding spawning of large
>>    numbers of CTDB processes, which seem to be created in a "big bang",
>>    and then do what they do and exit in the subsequent 10-60 seconds.
>> 
>>    We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>    from the cluster - and these are both fine throughout.
>> 
>>    This was happening 5-10 times per hour, not at exact intervals
>>    though. When we added a third node to the CTDB cluster, it "got
>>    worse", and when we dropped the CTDB cluster down to a single node
>>    and everything started behaving fine - which is where we are now.
>> 
>>    So, I've got a bunch of questions!
>> 
>>      - does anyone know why ctdb would be spawning these processes, and
>>    if there's anything we can do to stop it needing to do it?
>>      - has anyone done any more general performance / config
>>    optimisation of CTDB?
>> 
>>    And - more generally - does anyone else actually use ctdb/samba/gpfs
>>    on the scale of ~500 users or higher? If so - how do you find it?
>> 
>> 
>>    --
>>                 --
>>        Dr Orlando Richards
>>       Information Services
>>    IT Infrastructure Division
>>            Unix Section
>>         Tel: 0131 650 4994
>> 
>>    The University of Edinburgh is a charitable body, registered in
>>    Scotland, with registration number SC005336.
>>    _________________________________________________
>>    gpfsug-discuss mailing list
>>    gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>    http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>    <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>> 
>> 
>> 
>> 
>> --
>> 
>> Bob Cregan
>> 
>> Senior Storage Systems Administrator
>> 
>> ACRC
>> 
>> Bristol University
>> 
>> Tel:     +44 (0) 117 331 4406
>> 
>> skype:  bobcregan
>> 
>> Mobile: +44 (0) 7712388129
>> 
> 
> 
> -- 
>            --
>   Dr Orlando Richards
>  Information Services
> IT Infrastructure Division
>       Unix Section
>    Tel: 0131 650 4994
> 
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


From orlando.richards at ed.ac.uk  Mon Apr 15 10:54:39 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 15 Apr 2013 10:54:39 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
Message-ID: <516BCE5F.8010309@ed.ac.uk>

On 12/04/13 19:44, Vic Cornell wrote:
> Have you tried putting the ctdb files onto a separate gpfs filesystem?

No - but considered it. However, the only "live" CTDB file that sits on 
GPFS is the reclock file, which - I think - is only used as the 
heartbeat between nodes and for the recovery process. Now, there's 
mileage in insulating that, certainly, but I don't think that's what 
we're suffering from here.

On a positive note - we took the steps this morning to re-initialise the 
ctdb databases from current data, and things seem to be stable today so far.

Basically - shut down ctdb on all but one node. On all but that node, do:
mv /var/ctdb/ /var/ctdb.save.date

then start up ctdb on those nodes. Once they've come up, shut down ctdb 
on the last node, move /var/ctdb out the way, and restart. That brings 
them all up with freshly compacted databases.

Also, from the samba-technical mailing list came the advice to use a 
more recent ctdb - specifically, 1.2.61. I've got that built and ready 
to go (and a rebuilt samba compiled against it too), but if things prove 
to be stable after today's compacting, then we will probably leave it at 
that and not deploy this.

Interesting that 2.0 wasn't suggested for "stable", and that the current 
"dev" version is 2.1.

For reference, here's the start of the thread:
https://lists.samba.org/archive/samba-technical/2013-April/091525.html

--
Orlando.


>
> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:
>
>> On 12/04/13 15:43, Bob Cregan wrote:
>>> Hi Orlando,
>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>> (GPFS version 3.4.0-13) . Current versions are
>>>
>>> ctdb - 1.0.99
>>> samba 3.5.15
>>>
>>> Both compiled from source. We have about 300+ users normally.
>>>
>>
>> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
>>
>>
>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>> bad moments over the last year . These have gone away since we have
>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>> be2net) which lead to occasional dropped packets for jumbo frames. There
>>> have been no issues with samba/ctdb
>>>
>>> The only comment I can make is that during initial investigations into
>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>> with error messages like:
>>>
>>>   configure: checking whether cluster support is available
>>> checking for ctdb.h... yes
>>> checking for ctdb_private.h... yes
>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>> configure: error: "cluster support not available: support for
>>> SCHEDULE_FOR_DELETION control missing"
>>>
>>>
>>> What occurs to me is that this message seems to indicate that it is
>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>   That would imply that an upgrade to a higher version of ctdb might
>>> help, of course it might not and make backing out harder.
>>
>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
>>
>>>
>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>> meeting first!
>>>
>>
>> It has to be said - the timing is good!
>> Cheers,
>> Orlando
>>
>>>
>>> Thanks
>>>
>>> Bob
>>>
>>>
>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>
>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>
>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>     performance over the last few weeks, likely triggered either by an
>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>>     or possibly by additional users coming on with a new workload.
>>>
>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>     that a roll back would fix the issue).
>>>
>>>     The symptoms are a complete freeze of the service for CIFS users for
>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>     numbers of CTDB processes, which seem to be created in a "big bang",
>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>
>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>     from the cluster - and these are both fine throughout.
>>>
>>>     This was happening 5-10 times per hour, not at exact intervals
>>>     though. When we added a third node to the CTDB cluster, it "got
>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>     and everything started behaving fine - which is where we are now.
>>>
>>>     So, I've got a bunch of questions!
>>>
>>>       - does anyone know why ctdb would be spawning these processes, and
>>>     if there's anything we can do to stop it needing to do it?
>>>       - has anyone done any more general performance / config
>>>     optimisation of CTDB?
>>>
>>>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>
>>>
>>>     --
>>>                  --
>>>         Dr Orlando Richards
>>>        Information Services
>>>     IT Infrastructure Division
>>>             Unix Section
>>>          Tel: 0131 650 4994
>>>
>>>     The University of Edinburgh is a charitable body, registered in
>>>     Scotland, with registration number SC005336.
>>>     _________________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bob Cregan
>>>
>>> Senior Storage Systems Administrator
>>>
>>> ACRC
>>>
>>> Bristol University
>>>
>>> Tel:     +44 (0) 117 331 4406
>>>
>>> skype:  bobcregan
>>>
>>> Mobile: +44 (0) 7712388129
>>>
>>
>>
>> --
>>             --
>>    Dr Orlando Richards
>>   Information Services
>> IT Infrastructure Division
>>        Unix Section
>>     Tel: 0131 650 4994
>>
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From crobson at ocf.co.uk  Mon Apr 15 15:04:38 2013
From: crobson at ocf.co.uk (Claire Robson)
Date: Mon, 15 Apr 2013 15:04:38 +0100
Subject: [gpfsug-discuss] Latest agenda and places still available
Message-ID: <A42128435E851644B9B011BB824F6C81614669752C@MAIL.ocf.local>

Dear All,

Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend.

This is the latest agenda for the day:
10:30     Arrivals and refreshments
11:00     Introductions and committee updates
Jez Tucker, Group Chair & Claire Robson, Group Secretary
11:05     GPFS FPO
Dinesh Subhraveti, IBM Almaden Research Labs
12:00     SAMBA 4.0 & CTDB 2.0
               Michael Adams, SAMBA Development Team
13:00     Lunch (Buffet provided)
13:45     GPFS OpenStack Integration
               Dinesh Subhraveti, IBM Almaden Research Labs
14:15     SAMBA & GPFS Integration
               Volker Lendecke, SAMBA Development Team
15:15     Refreshments break
15:30     GPFS Native RAID & LTFS
Jim Roche, IBM
16:00     Group discussion: Questions & Committee matters
Led by Jez Tucker, Group Chairperson
16:05     Close

I look forward to seeing many of you next week.

Kind regards,

Claire Robson
GPFS user group Secetary

Tel: 0114 257 2200
Mob: 07508 033896
Web: www.gpfsug.org<http://www.gpfsug.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130415/0c87469e/attachment-0001.htm>

From AHMADYH at sa.ibm.com  Tue Apr 16 13:08:58 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Tue, 16 Apr 2013 16:08:58 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/29/2013)
Message-ID: <OF891AC77C.D4A142FB-ON44257B4F.0042BD8B-44257B4F.0042BD8B@ae.ibm.com>


I am out of the office until 04/29/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From orlando.richards at ed.ac.uk  Wed Apr 17 11:30:32 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Wed, 17 Apr 2013 11:30:32 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516BCE5F.8010309@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk>
Message-ID: <516E79C8.8090603@ed.ac.uk>

Hi All - an update to this,

After re-initialising the databases on Monday, things did seem to be 
running better, but ultimately we got back to suffering from spikes in 
ctdb processes and corresponding "pauses" in service. We fell back to a 
single node again for Tuesday (and things were stable once again), and 
this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was 
rebuilt against CTDB 1.2.61 headers).

Things seem to be stable for now - more so than on Monday.

For the record - one metric I'm watching is the number of ctdb processes 
running (this would spike to > 1000 under the failure conditions). It's 
currently sitting consistently at 3 processes, with occasional blips of 
5-7 processes.

--
Orlando


On 15/04/13 10:54, Orlando Richards wrote:
> On 12/04/13 19:44, Vic Cornell wrote:
>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>
> No - but considered it. However, the only "live" CTDB file that sits on
> GPFS is the reclock file, which - I think - is only used as the
> heartbeat between nodes and for the recovery process. Now, there's
> mileage in insulating that, certainly, but I don't think that's what
> we're suffering from here.
>
> On a positive note - we took the steps this morning to re-initialise the
> ctdb databases from current data, and things seem to be stable today so
> far.
>
> Basically - shut down ctdb on all but one node. On all but that node, do:
> mv /var/ctdb/ /var/ctdb.save.date
>
> then start up ctdb on those nodes. Once they've come up, shut down ctdb
> on the last node, move /var/ctdb out the way, and restart. That brings
> them all up with freshly compacted databases.
>
> Also, from the samba-technical mailing list came the advice to use a
> more recent ctdb - specifically, 1.2.61. I've got that built and ready
> to go (and a rebuilt samba compiled against it too), but if things prove
> to be stable after today's compacting, then we will probably leave it at
> that and not deploy this.
>
> Interesting that 2.0 wasn't suggested for "stable", and that the current
> "dev" version is 2.1.
>
> For reference, here's the start of the thread:
> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>
> --
> Orlando.
>
>
>
>>
>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>> wrote:
>>
>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>> Hi Orlando,
>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>
>>>> ctdb - 1.0.99
>>>> samba 3.5.15
>>>>
>>>> Both compiled from source. We have about 300+ users normally.
>>>>
>>>
>>> We have suspicions that 3.6 has put additional "chatter" into the
>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>> has found that the clustered locking databases, in particular, prove
>>> to be a scalability/usability limit for ctdb.
>>>
>>>
>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>> bad moments over the last year . These have gone away since we have
>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>> There
>>>> have been no issues with samba/ctdb
>>>>
>>>> The only comment I can make is that during initial investigations into
>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>> with error messages like:
>>>>
>>>>   configure: checking whether cluster support is available
>>>> checking for ctdb.h... yes
>>>> checking for ctdb_private.h... yes
>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>> configure: error: "cluster support not available: support for
>>>> SCHEDULE_FOR_DELETION control missing"
>>>>
>>>>
>>>> What occurs to me is that this message seems to indicate that it is
>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>> help, of course it might not and make backing out harder.
>>>
>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>> The versioning in CTDB has proved hard for me to fathom...
>>>
>>>>
>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>> meeting first!
>>>>
>>>
>>> It has to be said - the timing is good!
>>> Cheers,
>>> Orlando
>>>
>>>>
>>>> Thanks
>>>>
>>>> Bob
>>>>
>>>>
>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>
>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>
>>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>     performance over the last few weeks, likely triggered either by an
>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>> result),
>>>>     or possibly by additional users coming on with a new workload.
>>>>
>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>     that a roll back would fix the issue).
>>>>
>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>> for
>>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>> bang",
>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>
>>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>>     from the cluster - and these are both fine throughout.
>>>>
>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>     and everything started behaving fine - which is where we are now.
>>>>
>>>>     So, I've got a bunch of questions!
>>>>
>>>>       - does anyone know why ctdb would be spawning these processes,
>>>> and
>>>>     if there's anything we can do to stop it needing to do it?
>>>>       - has anyone done any more general performance / config
>>>>     optimisation of CTDB?
>>>>
>>>>     And - more generally - does anyone else actually use
>>>> ctdb/samba/gpfs
>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>
>>>>
>>>>     --
>>>>                  --
>>>>         Dr Orlando Richards
>>>>        Information Services
>>>>     IT Infrastructure Division
>>>>             Unix Section
>>>>          Tel: 0131 650 4994
>>>>
>>>>     The University of Edinburgh is a charitable body, registered in
>>>>     Scotland, with registration number SC005336.
>>>>     _________________________________________________
>>>>     gpfsug-discuss mailing list
>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bob Cregan
>>>>
>>>> Senior Storage Systems Administrator
>>>>
>>>> ACRC
>>>>
>>>> Bristol University
>>>>
>>>> Tel:     +44 (0) 117 331 4406
>>>>
>>>> skype:  bobcregan
>>>>
>>>> Mobile: +44 (0) 7712388129
>>>>
>>>
>>>
>>> --
>>>             --
>>>    Dr Orlando Richards
>>>   Information Services
>>> IT Infrastructure Division
>>>        Unix Section
>>>     Tel: 0131 650 4994
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From orlando.richards at ed.ac.uk  Mon Apr 22 15:52:55 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 22 Apr 2013 15:52:55 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516E79C8.8090603@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk>
Message-ID: <51754EC7.8000600@ed.ac.uk>

On 17/04/13 11:30, Orlando Richards wrote:
> Hi All - an update to this,
>
> After re-initialising the databases on Monday, things did seem to be
> running better, but ultimately we got back to suffering from spikes in
> ctdb processes and corresponding "pauses" in service. We fell back to a
> single node again for Tuesday (and things were stable once again), and
> this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was
> rebuilt against CTDB 1.2.61 headers).
>
> Things seem to be stable for now - more so than on Monday.
>
> For the record - one metric I'm watching is the number of ctdb processes
> running (this would spike to > 1000 under the failure conditions). It's
> currently sitting consistently at 3 processes, with occasional blips of
> 5-7 processes.
>


Hi all,

Looks like things have been running fine since we upgraded ctdb last 
Wednesday, so I think it's safe to say that we've found a fix for our 
problem in CTDB 1.2.61.

Thanks for all the input! If anyone wants more info, feel free to get in 
touch.


--
Orlando

> --
> Orlando
>
>
>
>
>
> On 15/04/13 10:54, Orlando Richards wrote:
>> On 12/04/13 19:44, Vic Cornell wrote:
>>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>>
>> No - but considered it. However, the only "live" CTDB file that sits on
>> GPFS is the reclock file, which - I think - is only used as the
>> heartbeat between nodes and for the recovery process. Now, there's
>> mileage in insulating that, certainly, but I don't think that's what
>> we're suffering from here.
>>
>> On a positive note - we took the steps this morning to re-initialise the
>> ctdb databases from current data, and things seem to be stable today so
>> far.
>>
>> Basically - shut down ctdb on all but one node. On all but that node, do:
>> mv /var/ctdb/ /var/ctdb.save.date
>>
>> then start up ctdb on those nodes. Once they've come up, shut down ctdb
>> on the last node, move /var/ctdb out the way, and restart. That brings
>> them all up with freshly compacted databases.
>>
>> Also, from the samba-technical mailing list came the advice to use a
>> more recent ctdb - specifically, 1.2.61. I've got that built and ready
>> to go (and a rebuilt samba compiled against it too), but if things prove
>> to be stable after today's compacting, then we will probably leave it at
>> that and not deploy this.
>>
>> Interesting that 2.0 wasn't suggested for "stable", and that the current
>> "dev" version is 2.1.
>>
>> For reference, here's the start of the thread:
>> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>>
>> --
>> Orlando.
>>
>>
>>
>>>
>>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>>> wrote:
>>>
>>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>>> Hi Orlando,
>>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>>
>>>>> ctdb - 1.0.99
>>>>> samba 3.5.15
>>>>>
>>>>> Both compiled from source. We have about 300+ users normally.
>>>>>
>>>>
>>>> We have suspicions that 3.6 has put additional "chatter" into the
>>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>>> has found that the clustered locking databases, in particular, prove
>>>> to be a scalability/usability limit for ctdb.
>>>>
>>>>
>>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>>> bad moments over the last year . These have gone away since we have
>>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>>> There
>>>>> have been no issues with samba/ctdb
>>>>>
>>>>> The only comment I can make is that during initial investigations into
>>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>>> with error messages like:
>>>>>
>>>>>   configure: checking whether cluster support is available
>>>>> checking for ctdb.h... yes
>>>>> checking for ctdb_private.h... yes
>>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>>> configure: error: "cluster support not available: support for
>>>>> SCHEDULE_FOR_DELETION control missing"
>>>>>
>>>>>
>>>>> What occurs to me is that this message seems to indicate that it is
>>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>>> help, of course it might not and make backing out harder.
>>>>
>>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>>> The versioning in CTDB has proved hard for me to fathom...
>>>>
>>>>>
>>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>>> meeting first!
>>>>>
>>>>
>>>> It has to be said - the timing is good!
>>>> Cheers,
>>>> Orlando
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>>
>>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>>
>>>>>     We've long been using CTDB and Samba for our NAS service,
>>>>> servicing
>>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>>     performance over the last few weeks, likely triggered either by an
>>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>>> result),
>>>>>     or possibly by additional users coming on with a new workload.
>>>>>
>>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>>     that a roll back would fix the issue).
>>>>>
>>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>>> for
>>>>>     10-60 seconds, and on the servers a corresponding spawning of
>>>>> large
>>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>>> bang",
>>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>>
>>>>>     We also serve up NFS from the same ctdb-managed frontends, and
>>>>> GPFS
>>>>>     from the cluster - and these are both fine throughout.
>>>>>
>>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>>     and everything started behaving fine - which is where we are now.
>>>>>
>>>>>     So, I've got a bunch of questions!
>>>>>
>>>>>       - does anyone know why ctdb would be spawning these processes,
>>>>> and
>>>>>     if there's anything we can do to stop it needing to do it?
>>>>>       - has anyone done any more general performance / config
>>>>>     optimisation of CTDB?
>>>>>
>>>>>     And - more generally - does anyone else actually use
>>>>> ctdb/samba/gpfs
>>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>>
>>>>>
>>>>>     --
>>>>>                  --
>>>>>         Dr Orlando Richards
>>>>>        Information Services
>>>>>     IT Infrastructure Division
>>>>>             Unix Section
>>>>>          Tel: 0131 650 4994
>>>>>
>>>>>     The University of Edinburgh is a charitable body, registered in
>>>>>     Scotland, with registration number SC005336.
>>>>>     _________________________________________________
>>>>>     gpfsug-discuss mailing list
>>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bob Cregan
>>>>>
>>>>> Senior Storage Systems Administrator
>>>>>
>>>>> ACRC
>>>>>
>>>>> Bristol University
>>>>>
>>>>> Tel:     +44 (0) 117 331 4406
>>>>>
>>>>> skype:  bobcregan
>>>>>
>>>>> Mobile: +44 (0) 7712388129
>>>>>
>>>>
>>>>
>>>> --
>>>>             --
>>>>    Dr Orlando Richards
>>>>   Information Services
>>>> IT Infrastructure Division
>>>>        Unix Section
>>>>     Tel: 0131 650 4994
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 10:38:07 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 10:38:07 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
Message-ID: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>

Hi all

Good to see lots of you at the user group meeting yesterday. Great work,
Jez!

We're setting up a test cluster here at Realise, with a view to moving our
main storage over from Gluster.

We're running the test cluster on Isilon hardware ... a couple of 1920
nodes that we were using for home dirs. Each node has dual gigabit ethernet
ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB
RAM. All good stuff and should make a nice test rig.

I have a few questions!

1.  We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah
to 3.5?
2.  I'm having trouble assigning NSDs. I have a descfile which looks like:

#DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1

but the command

"mmcrnsd -F /tmp/descfile -v no"

just craps out with

mmcrnsd: Processing disk sdc1
mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license
designation.
mmcrnsd: Error found while checking disk descriptor
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
mmcrnsd: Command failed.  Examine previous error messages to determine
cause.

Any help pointing me gently in the right direction would be much
appreciated. :-)

TIA

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/671e3a7e/attachment-0001.htm>

From orlando.richards at ed.ac.uk  Thu Apr 25 10:48:30 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Thu, 25 Apr 2013 10:48:30 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <5178FBEE.4070200@ed.ac.uk>

On 25/04/13 10:38, Pete Smith wrote:
> Hi all
>
> Good to see lots of you at the user group meeting yesterday. Great work,
> Jez!
>
> We're setting up a test cluster here at Realise, with a view to moving
> our main storage over from Gluster.
>
> We're running the test cluster on Isilon hardware ... a couple of 1920
> nodes that we were using for home dirs. Each node has dual gigabit
> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
> and and 4GB RAM. All good stuff and should make a nice test rig.
>
> I have a few questions!
>
> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?
> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>
> #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
>
> but the command
>
> "mmcrnsd -F /tmp/descfile -v no"
>
> just craps out with
>
> mmcrnsd: Processing disk sdc1
> mmcrnsd: Node gpfs001.realisestudio.com
> <http://gpfs001.realisestudio.com> does not have a GPFS server license
> designation.
> mmcrnsd: Error found while checking disk descriptor
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
> mmcrnsd: Command failed.  Examine previous error messages to determine
> cause.
>

mmchlicense server -N gpfs001.realisestudio.com should sort that one out.


> Any help pointing me gently in the right direction would be much
> appreciated. :-)
>
> TIA
>
> --
> Pete Smith
> DevOp/System Administrator
> Realise Studio
> 12/13 Poland Street, London W1F 8QB
> T. +44 (0)20 7165 9644
>
> realisestudio.com <http://realisestudio.com>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 11:05:36 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 11:05:36 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <5178FBEE.4070200@ed.ac.uk>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
	<5178FBEE.4070200@ed.ac.uk>
Message-ID: <CAM9ZKkiZ-7xUKZT3TyNDHJByufV+aqF+UQN=8EUw_pF2y8D=JA@mail.gmail.com>

Thanks Orlando. Much appreciated.


On 25 April 2013 10:48, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 25/04/13 10:38, Pete Smith wrote:
>
>> Hi all
>>
>> Good to see lots of you at the user group meeting yesterday. Great work,
>> Jez!
>>
>> We're setting up a test cluster here at Realise, with a view to moving
>> our main storage over from Gluster.
>>
>> We're running the test cluster on Isilon hardware ... a couple of 1920
>> nodes that we were using for home dirs. Each node has dual gigabit
>> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
>> and and 4GB RAM. All good stuff and should make a nice test rig.
>>
>> I have a few questions!
>>
>> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
>> 3.3.blah to 3.5?
>> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>>
>> #DiskName:PrimaryServer:**BackupServer:DiskUsage:**
>> FailureGroup:DesiredName:**StoragePool
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>>
>> but the command
>>
>> "mmcrnsd -F /tmp/descfile -v no"
>>
>> just craps out with
>>
>> mmcrnsd: Processing disk sdc1
>> mmcrnsd: Node gpfs001.realisestudio.com
>> <http://gpfs001.realisestudio.**com <http://gpfs001.realisestudio.com>>
>> does not have a GPFS server license
>> designation.
>> mmcrnsd: Error found while checking disk descriptor
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>> mmcrnsd: Command failed.  Examine previous error messages to determine
>> cause.
>>
>>
> mmchlicense server -N gpfs001.realisestudio.com should sort that one out.
>
>
>  Any help pointing me gently in the right direction would be much
>> appreciated. :-)
>>
>> TIA
>>
>> --
>> Pete Smith
>> DevOp/System Administrator
>> Realise Studio
>> 12/13 Poland Street, London W1F 8QB
>> T. +44 (0)20 7165 9644
>>
>> realisestudio.com <http://realisestudio.com>
>>
>>
>> ______________________________**_________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>
>>
>
> --
>             --
>    Dr Orlando Richards
>   Information Services
> IT Infrastructure Division
>        Unix Section
>     Tel: 0131 650 4994
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> ______________________________**_________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>


-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/c132d89e/attachment-0001.htm>

From pete at realisestudio.com  Fri Apr 26 16:06:38 2013
From: pete at realisestudio.com (Pete Smith)
Date: Fri, 26 Apr 2013 16:06:38 +0100
Subject: [gpfsug-discuss] GPS Native RAID on linux?
Message-ID: <CAM9ZKkh0kGKKNnxBoogcN_y-TUyKGUypu-io8qfgs1rXrV0GnQ@mail.gmail.com>

Hi

I thought from the presentation that this was available on linux ... but
documentation seems to indicate IBM GSS only?

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130426/872c1ad8/attachment-0001.htm>

From stuartb at 4gh.net  Tue Apr 30 21:50:38 2013
From: stuartb at 4gh.net (Stuart Barkley)
Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT)
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <alpine.BSF.2.00.1304301643540.4313@freeman.4gh.net>

On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote:

> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?

We are in transition to 3.5 on our original GPFS installation.  Two of
four servers are now at GPFS 3.4.XX/CentOS 6.4.  Two servers are still
at 3.3.YY/CentOS 5.4.  The compute nodes are all to 3.4.XX/CentOS 6.4.

The data center is remotely located and it is a pain to get physical
access.  Once we get the last two nodes upgraded, we expect to go to
GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS
installation).

My understanding is that you need to step through 3.4 during a
migration from 3.3 to 3.5.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone


From bdeluca at gmail.com  Wed Apr  3 10:57:05 2013
From: bdeluca at gmail.com (Ben De Luca)
Date: Wed, 3 Apr 2013 10:57:05 +0100
Subject: [gpfsug-discuss] mmbackup and management classes
Message-ID: <CAGC__DhH-O8oxSY8t=xXyfV7+M6x-sfjwGP7AN3c_YKrDWm-Ug@mail.gmail.com>

Hi gpfsusers,
       My first post to the list, Hi!

We tsm for our backups of our gpfs filesystems, we are looking at using the
mmbackup for script for launching our backups.

>From conversations with other people we hear that support for  management
classes may not be completely available in mmbackup?

I wondered if any one could comment on using mmbackup, and what and what
not is supported. Any gotchas?


-bd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130403/1f70a7e1/attachment-0002.htm>

From AHMADYH at sa.ibm.com  Wed Apr  3 13:04:47 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Wed, 3 Apr 2013 16:04:47 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/08/2013)
Message-ID: <OF2EDD4A8D.D01283B2-ON44257B42.00425B77-44257B42.00425B77@ae.ibm.com>


I am out of the office until 04/08/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From chris_stone at uk.ibm.com  Wed Apr  3 16:08:39 2013
From: chris_stone at uk.ibm.com (Chris Stone)
Date: Wed, 3 Apr 2013 16:08:39 +0100
Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office
 until 16/08/2004. (returning 11/04/2013)
Message-ID: <OF7A8A96D8.7012B265-ON80257B42.005330B4-80257B42.005330B4@uk.ibm.com>


I am out of the office until 11/04/2013.

In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017
418


Note: This is an automated response to your message  "[gpfsug-discuss]
mmbackup and management classes" sent on 03/04/2013 10:57:05.

This is the only notification you will receive while this person is away.


From ANDREWD at uk.ibm.com  Wed Apr  3 16:10:26 2013
From: ANDREWD at uk.ibm.com (Andrew Downes1)
Date: Wed, 3 Apr 2013 16:10:26 +0100
Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office
	(returning 08/04/2013)
Message-ID: <OFB0FCE5CF.87D63AFD-ON80257B42.00535A6C-80257B42.00535A6C@uk.ibm.com>


I am out of the office until 08/04/2013.

If anything is too urgent to wait  for my return please contact Matt Ayres
mailto:m_ayres at uk.ibm.com 44-7710-981527

In case of urgency, please contact our manager Dave Shave-Wall
mailto:dave_shavewall at uk.ibm.com 44-7740-921623


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02.

This is the only notification you will receive while this person is away.


From ashish.thandavan at cs.ox.ac.uk  Thu Apr 11 10:58:41 2013
From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan)
Date: Thu, 11 Apr 2013 10:58:41 +0100
Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged
In-Reply-To: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
References: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
Message-ID: <51668951.7040506@cs.ox.ac.uk>

Dear Claire,

I trust you are well! If there are any spaces left, could you please 
register me for the event?

Thank you!

Regards,
Ash

On 25/03/13 14:38, Claire Robson wrote:
>
> Dear All,
>
> The next meeting date is set for *Wednesday 24^th April* and will be 
> taking place at the fantastic Dolby Studios in London (Dolby Europe 
> Limited, 4--6 Soho Square, London W1D 3PZ).
>
> *Getting to Dolby Europe Limited, Soho Square, London*
>
> Leave the Tottenham Court Road tube station by the South Oxford Street 
> exit [Exit 1].
>
> Turn left onto Oxford Street.
>
> After about 50m turn left into Soho Street.
>
> Turn right into Soho Square.
>
> 4-6 Soho Square is directly in front of you.
>
> Our tentative agenda is as follows:
>
> 10:30     Arrivals and refreshments
>
> 11:00     Introductions and committee updates
>
> Jez Tucker, Group Chair & Claire Robson, Group Secretary
>
> 11:05     GPFS OpenStack Integration
>
> Prasenhit Sarkar, IBM Almaden Research Labs
>
>                GPFS FPO
>
>                Dinesh Subhraveti, IBM Almaden Research Labs
>
> 11:45     SAMBA 4.0 & CTDB 2.0
>
>                Michael Adams, SAMBA Development Team
>
> 12:15     SAMBA & GPFS Integration
>
>                Volker Lendecke, SAMBA Development Team
>
> 13:00     Lunch (Buffet provided)
>
> 14:00     GPFS Native RAID & LTFS
>
> Jim Roche, IBM
>
> 14:45     User Stories
>
> 15:45     Group discussion: Challenges, experiences and questions & 
> Committee matters
>
> Led by Jez Tucker, Group Chairperson
>
> 16:00     Close
>
> We will be starting at 11:00am and concluding at 4pm but some of the 
> speaker timings may alter slightly. I will be posting further details 
> on what the presentations cover over the coming week or so.
>
> We hope you can make it for what will be a really interesting day of 
> GPFS discussions. *Please register with me if you would like to 
> attend* -- registrations are based on a first come first served basis.
>
> Best regards,
>
> *Claire Robson*
>
> GPFS User Group Secreatry
>
> Tel: 0114 257 2200
>
> Mob: 07508 033896
>
> Fax: 0114 257 0022
>
> Web: _www.gpfsug.org <http://www.gpfsug.org>_
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
-------------------------
Ashish Thandavan

UNIX Support Computing Officer
Department of Computer Science
University of Oxford
Wolfson Building
Parks Road
Oxford OX1 3QD

Phone: 01865 610733
Email: ashish.thandavan at cs.ox.ac.uk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130411/079dd991/attachment-0002.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 13:37:52 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 13:37:52 +0100
Subject: [gpfsug-discuss] CTDB woes
Message-ID: <51680020.4040509@ed.ac.uk>

Hi folks,

We've long been using CTDB and Samba for our NAS service, servicing ~500 
users. We've been suffering from some problems with the CTDB performance 
over the last few weeks, likely triggered either by an upgrade of samba 
from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by 
additional users coming on with a new workload.

We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from 
sernet). Before we roll back, we'd like to make sure we can't fix the 
problem and stick with Samba 3.6 (and we don't even know that a roll 
back would fix the issue).

The symptoms are a complete freeze of the service for CIFS users for 
10-60 seconds, and on the servers a corresponding spawning of large 
numbers of CTDB processes, which seem to be created in a "big bang", and 
then do what they do and exit in the subsequent 10-60 seconds.

We also serve up NFS from the same ctdb-managed frontends, and GPFS from 
the cluster - and these are both fine throughout.

This was happening 5-10 times per hour, not at exact intervals though. 
When we added a third node to the CTDB cluster, it "got worse", and when 
we dropped the CTDB cluster down to a single node and everything started 
behaving fine - which is where we are now.

So, I've got a bunch of questions!

  - does anyone know why ctdb would be spawning these processes, and if 
there's anything we can do to stop it needing to do it?
  - has anyone done any more general performance / config optimisation 
of CTDB?

And - more generally - does anyone else actually use ctdb/samba/gpfs on 
the scale of ~500 users or higher? If so - how do you find it?


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From Tobias.Kuebler at sva.de  Fri Apr 12 14:03:58 2013
From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de)
Date: Fri, 12 Apr 2013 15:03:58 +0200
Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?=
 =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?=
Message-ID: <OFA98F07BD.090D53A5-ONC1257B4B.0047C660-C1257B4B.0047C660@sva.de>


Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend.

Vielen Dank f?r Ihre Nachricht.
Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet,
ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu
beantworten.
In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen
Vertriebsbeauftragten.


Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht
"[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52.

Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend
diese Person abwesend ist.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130412/36d2a4aa/attachment-0002.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 16:43:44 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 16:43:44 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
Message-ID: <51682BB0.7010507@ed.ac.uk>

On 12/04/13 15:43, Bob Cregan wrote:
> Hi Orlando,
>                        We use ctdb/samba for CIFS, and CNFS for NFS
> (GPFS version 3.4.0-13) . Current versions are
>
> ctdb - 1.0.99
> samba 3.5.15
>
> Both compiled from source. We have about 300+ users normally.
>

We have suspicions that 3.6 has put additional "chatter" into the ctdb 
database stream, which has pushed us over the edge. Barry Evans has 
found that the clustered locking databases, in particular, prove to be a 
scalability/usability limit for ctdb.


> We have had no issues with this setup apart from CNFS which had 2 or 3
> bad moments over the last year . These have gone away since we have
> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
> be2net) which lead to occasional dropped packets for jumbo frames. There
> have been no issues with samba/ctdb
>
> The only comment I can make is that during initial investigations into
> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
> with error messages like:
>
>   configure: checking whether cluster support is available
> checking for ctdb.h... yes
> checking for ctdb_private.h... yes
> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
> configure: error: "cluster support not available: support for
> SCHEDULE_FOR_DELETION control missing"
>
>
> What occurs to me is that this message seems to indicate that it is
> possible to run  a ctdb version that is incompatible with samba 3.6.
>   That would imply that an upgrade to a higher version of ctdb might
> help, of course it might not and make backing out harder.

Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The 
versioning in CTDB has proved hard for me to fathom...

>
> A compile against ctdb 2.0 works fine. We will soon be running in this
> upgrade, but I'm waiting to see what the samba  people say at the UG
> meeting first!
>

It has to be said - the timing is good!
Cheers,
Orlando

>
> Thanks
>
> Bob
>
>
> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
> <mailto:orlando.richards at ed.ac.uk>> wrote:
>
>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>
>     We've long been using CTDB and Samba for our NAS service, servicing
>     ~500 users. We've been suffering from some problems with the CTDB
>     performance over the last few weeks, likely triggered either by an
>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>     or possibly by additional users coming on with a new workload.
>
>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>     from sernet). Before we roll back, we'd like to make sure we can't
>     fix the problem and stick with Samba 3.6 (and we don't even know
>     that a roll back would fix the issue).
>
>     The symptoms are a complete freeze of the service for CIFS users for
>     10-60 seconds, and on the servers a corresponding spawning of large
>     numbers of CTDB processes, which seem to be created in a "big bang",
>     and then do what they do and exit in the subsequent 10-60 seconds.
>
>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>     from the cluster - and these are both fine throughout.
>
>     This was happening 5-10 times per hour, not at exact intervals
>     though. When we added a third node to the CTDB cluster, it "got
>     worse", and when we dropped the CTDB cluster down to a single node
>     and everything started behaving fine - which is where we are now.
>
>     So, I've got a bunch of questions!
>
>       - does anyone know why ctdb would be spawning these processes, and
>     if there's anything we can do to stop it needing to do it?
>       - has anyone done any more general performance / config
>     optimisation of CTDB?
>
>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>     on the scale of ~500 users or higher? If so - how do you find it?
>
>
>     --
>                  --
>         Dr Orlando Richards
>        Information Services
>     IT Infrastructure Division
>             Unix Section
>          Tel: 0131 650 4994
>
>     The University of Edinburgh is a charitable body, registered in
>     Scotland, with registration number SC005336.
>     _________________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
>
>
>
> --
>
> Bob Cregan
>
> Senior Storage Systems Administrator
>
> ACRC
>
> Bristol University
>
> Tel:     +44 (0) 117 331 4406
>
> skype:  bobcregan
>
> Mobile: +44 (0) 7712388129
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From viccornell at gmail.com  Fri Apr 12 19:44:16 2013
From: viccornell at gmail.com (Vic Cornell)
Date: Fri, 12 Apr 2013 19:44:16 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <51682BB0.7010507@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>

Have you tried putting the ctdb files onto a separate gpfs filesystem?

Vic Cornell
viccornell at gmail.com


On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 12/04/13 15:43, Bob Cregan wrote:
>> Hi Orlando,
>>                       We use ctdb/samba for CIFS, and CNFS for NFS
>> (GPFS version 3.4.0-13) . Current versions are
>> 
>> ctdb - 1.0.99
>> samba 3.5.15
>> 
>> Both compiled from source. We have about 300+ users normally.
>> 
> 
> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
> 
> 
>> We have had no issues with this setup apart from CNFS which had 2 or 3
>> bad moments over the last year . These have gone away since we have
>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>> be2net) which lead to occasional dropped packets for jumbo frames. There
>> have been no issues with samba/ctdb
>> 
>> The only comment I can make is that during initial investigations into
>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>> with error messages like:
>> 
>>  configure: checking whether cluster support is available
>> checking for ctdb.h... yes
>> checking for ctdb_private.h... yes
>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>> configure: error: "cluster support not available: support for
>> SCHEDULE_FOR_DELETION control missing"
>> 
>> 
>> What occurs to me is that this message seems to indicate that it is
>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>  That would imply that an upgrade to a higher version of ctdb might
>> help, of course it might not and make backing out harder.
> 
> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
> 
>> 
>> A compile against ctdb 2.0 works fine. We will soon be running in this
>> upgrade, but I'm waiting to see what the samba  people say at the UG
>> meeting first!
>> 
> 
> It has to be said - the timing is good!
> Cheers,
> Orlando
> 
>> 
>> Thanks
>> 
>> Bob
>> 
>> 
>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>> 
>>    Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>> 
>>    We've long been using CTDB and Samba for our NAS service, servicing
>>    ~500 users. We've been suffering from some problems with the CTDB
>>    performance over the last few weeks, likely triggered either by an
>>    upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>    or possibly by additional users coming on with a new workload.
>> 
>>    We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>    from sernet). Before we roll back, we'd like to make sure we can't
>>    fix the problem and stick with Samba 3.6 (and we don't even know
>>    that a roll back would fix the issue).
>> 
>>    The symptoms are a complete freeze of the service for CIFS users for
>>    10-60 seconds, and on the servers a corresponding spawning of large
>>    numbers of CTDB processes, which seem to be created in a "big bang",
>>    and then do what they do and exit in the subsequent 10-60 seconds.
>> 
>>    We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>    from the cluster - and these are both fine throughout.
>> 
>>    This was happening 5-10 times per hour, not at exact intervals
>>    though. When we added a third node to the CTDB cluster, it "got
>>    worse", and when we dropped the CTDB cluster down to a single node
>>    and everything started behaving fine - which is where we are now.
>> 
>>    So, I've got a bunch of questions!
>> 
>>      - does anyone know why ctdb would be spawning these processes, and
>>    if there's anything we can do to stop it needing to do it?
>>      - has anyone done any more general performance / config
>>    optimisation of CTDB?
>> 
>>    And - more generally - does anyone else actually use ctdb/samba/gpfs
>>    on the scale of ~500 users or higher? If so - how do you find it?
>> 
>> 
>>    --
>>                 --
>>        Dr Orlando Richards
>>       Information Services
>>    IT Infrastructure Division
>>            Unix Section
>>         Tel: 0131 650 4994
>> 
>>    The University of Edinburgh is a charitable body, registered in
>>    Scotland, with registration number SC005336.
>>    _________________________________________________
>>    gpfsug-discuss mailing list
>>    gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>    http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>    <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>> 
>> 
>> 
>> 
>> --
>> 
>> Bob Cregan
>> 
>> Senior Storage Systems Administrator
>> 
>> ACRC
>> 
>> Bristol University
>> 
>> Tel:     +44 (0) 117 331 4406
>> 
>> skype:  bobcregan
>> 
>> Mobile: +44 (0) 7712388129
>> 
> 
> 
> -- 
>            --
>   Dr Orlando Richards
>  Information Services
> IT Infrastructure Division
>       Unix Section
>    Tel: 0131 650 4994
> 
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


From orlando.richards at ed.ac.uk  Mon Apr 15 10:54:39 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 15 Apr 2013 10:54:39 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
Message-ID: <516BCE5F.8010309@ed.ac.uk>

On 12/04/13 19:44, Vic Cornell wrote:
> Have you tried putting the ctdb files onto a separate gpfs filesystem?

No - but considered it. However, the only "live" CTDB file that sits on 
GPFS is the reclock file, which - I think - is only used as the 
heartbeat between nodes and for the recovery process. Now, there's 
mileage in insulating that, certainly, but I don't think that's what 
we're suffering from here.

On a positive note - we took the steps this morning to re-initialise the 
ctdb databases from current data, and things seem to be stable today so far.

Basically - shut down ctdb on all but one node. On all but that node, do:
mv /var/ctdb/ /var/ctdb.save.date

then start up ctdb on those nodes. Once they've come up, shut down ctdb 
on the last node, move /var/ctdb out the way, and restart. That brings 
them all up with freshly compacted databases.

Also, from the samba-technical mailing list came the advice to use a 
more recent ctdb - specifically, 1.2.61. I've got that built and ready 
to go (and a rebuilt samba compiled against it too), but if things prove 
to be stable after today's compacting, then we will probably leave it at 
that and not deploy this.

Interesting that 2.0 wasn't suggested for "stable", and that the current 
"dev" version is 2.1.

For reference, here's the start of the thread:
https://lists.samba.org/archive/samba-technical/2013-April/091525.html

--
Orlando.


>
> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:
>
>> On 12/04/13 15:43, Bob Cregan wrote:
>>> Hi Orlando,
>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>> (GPFS version 3.4.0-13) . Current versions are
>>>
>>> ctdb - 1.0.99
>>> samba 3.5.15
>>>
>>> Both compiled from source. We have about 300+ users normally.
>>>
>>
>> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
>>
>>
>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>> bad moments over the last year . These have gone away since we have
>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>> be2net) which lead to occasional dropped packets for jumbo frames. There
>>> have been no issues with samba/ctdb
>>>
>>> The only comment I can make is that during initial investigations into
>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>> with error messages like:
>>>
>>>   configure: checking whether cluster support is available
>>> checking for ctdb.h... yes
>>> checking for ctdb_private.h... yes
>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>> configure: error: "cluster support not available: support for
>>> SCHEDULE_FOR_DELETION control missing"
>>>
>>>
>>> What occurs to me is that this message seems to indicate that it is
>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>   That would imply that an upgrade to a higher version of ctdb might
>>> help, of course it might not and make backing out harder.
>>
>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
>>
>>>
>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>> meeting first!
>>>
>>
>> It has to be said - the timing is good!
>> Cheers,
>> Orlando
>>
>>>
>>> Thanks
>>>
>>> Bob
>>>
>>>
>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>
>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>
>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>     performance over the last few weeks, likely triggered either by an
>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>>     or possibly by additional users coming on with a new workload.
>>>
>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>     that a roll back would fix the issue).
>>>
>>>     The symptoms are a complete freeze of the service for CIFS users for
>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>     numbers of CTDB processes, which seem to be created in a "big bang",
>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>
>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>     from the cluster - and these are both fine throughout.
>>>
>>>     This was happening 5-10 times per hour, not at exact intervals
>>>     though. When we added a third node to the CTDB cluster, it "got
>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>     and everything started behaving fine - which is where we are now.
>>>
>>>     So, I've got a bunch of questions!
>>>
>>>       - does anyone know why ctdb would be spawning these processes, and
>>>     if there's anything we can do to stop it needing to do it?
>>>       - has anyone done any more general performance / config
>>>     optimisation of CTDB?
>>>
>>>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>
>>>
>>>     --
>>>                  --
>>>         Dr Orlando Richards
>>>        Information Services
>>>     IT Infrastructure Division
>>>             Unix Section
>>>          Tel: 0131 650 4994
>>>
>>>     The University of Edinburgh is a charitable body, registered in
>>>     Scotland, with registration number SC005336.
>>>     _________________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bob Cregan
>>>
>>> Senior Storage Systems Administrator
>>>
>>> ACRC
>>>
>>> Bristol University
>>>
>>> Tel:     +44 (0) 117 331 4406
>>>
>>> skype:  bobcregan
>>>
>>> Mobile: +44 (0) 7712388129
>>>
>>
>>
>> --
>>             --
>>    Dr Orlando Richards
>>   Information Services
>> IT Infrastructure Division
>>        Unix Section
>>     Tel: 0131 650 4994
>>
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From crobson at ocf.co.uk  Mon Apr 15 15:04:38 2013
From: crobson at ocf.co.uk (Claire Robson)
Date: Mon, 15 Apr 2013 15:04:38 +0100
Subject: [gpfsug-discuss] Latest agenda and places still available
Message-ID: <A42128435E851644B9B011BB824F6C81614669752C@MAIL.ocf.local>

Dear All,

Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend.

This is the latest agenda for the day:
10:30     Arrivals and refreshments
11:00     Introductions and committee updates
Jez Tucker, Group Chair & Claire Robson, Group Secretary
11:05     GPFS FPO
Dinesh Subhraveti, IBM Almaden Research Labs
12:00     SAMBA 4.0 & CTDB 2.0
               Michael Adams, SAMBA Development Team
13:00     Lunch (Buffet provided)
13:45     GPFS OpenStack Integration
               Dinesh Subhraveti, IBM Almaden Research Labs
14:15     SAMBA & GPFS Integration
               Volker Lendecke, SAMBA Development Team
15:15     Refreshments break
15:30     GPFS Native RAID & LTFS
Jim Roche, IBM
16:00     Group discussion: Questions & Committee matters
Led by Jez Tucker, Group Chairperson
16:05     Close

I look forward to seeing many of you next week.

Kind regards,

Claire Robson
GPFS user group Secetary

Tel: 0114 257 2200
Mob: 07508 033896
Web: www.gpfsug.org<http://www.gpfsug.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130415/0c87469e/attachment-0002.htm>

From AHMADYH at sa.ibm.com  Tue Apr 16 13:08:58 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Tue, 16 Apr 2013 16:08:58 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/29/2013)
Message-ID: <OF891AC77C.D4A142FB-ON44257B4F.0042BD8B-44257B4F.0042BD8B@ae.ibm.com>


I am out of the office until 04/29/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From orlando.richards at ed.ac.uk  Wed Apr 17 11:30:32 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Wed, 17 Apr 2013 11:30:32 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516BCE5F.8010309@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk>
Message-ID: <516E79C8.8090603@ed.ac.uk>

Hi All - an update to this,

After re-initialising the databases on Monday, things did seem to be 
running better, but ultimately we got back to suffering from spikes in 
ctdb processes and corresponding "pauses" in service. We fell back to a 
single node again for Tuesday (and things were stable once again), and 
this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was 
rebuilt against CTDB 1.2.61 headers).

Things seem to be stable for now - more so than on Monday.

For the record - one metric I'm watching is the number of ctdb processes 
running (this would spike to > 1000 under the failure conditions). It's 
currently sitting consistently at 3 processes, with occasional blips of 
5-7 processes.

--
Orlando


On 15/04/13 10:54, Orlando Richards wrote:
> On 12/04/13 19:44, Vic Cornell wrote:
>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>
> No - but considered it. However, the only "live" CTDB file that sits on
> GPFS is the reclock file, which - I think - is only used as the
> heartbeat between nodes and for the recovery process. Now, there's
> mileage in insulating that, certainly, but I don't think that's what
> we're suffering from here.
>
> On a positive note - we took the steps this morning to re-initialise the
> ctdb databases from current data, and things seem to be stable today so
> far.
>
> Basically - shut down ctdb on all but one node. On all but that node, do:
> mv /var/ctdb/ /var/ctdb.save.date
>
> then start up ctdb on those nodes. Once they've come up, shut down ctdb
> on the last node, move /var/ctdb out the way, and restart. That brings
> them all up with freshly compacted databases.
>
> Also, from the samba-technical mailing list came the advice to use a
> more recent ctdb - specifically, 1.2.61. I've got that built and ready
> to go (and a rebuilt samba compiled against it too), but if things prove
> to be stable after today's compacting, then we will probably leave it at
> that and not deploy this.
>
> Interesting that 2.0 wasn't suggested for "stable", and that the current
> "dev" version is 2.1.
>
> For reference, here's the start of the thread:
> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>
> --
> Orlando.
>
>
>
>>
>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>> wrote:
>>
>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>> Hi Orlando,
>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>
>>>> ctdb - 1.0.99
>>>> samba 3.5.15
>>>>
>>>> Both compiled from source. We have about 300+ users normally.
>>>>
>>>
>>> We have suspicions that 3.6 has put additional "chatter" into the
>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>> has found that the clustered locking databases, in particular, prove
>>> to be a scalability/usability limit for ctdb.
>>>
>>>
>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>> bad moments over the last year . These have gone away since we have
>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>> There
>>>> have been no issues with samba/ctdb
>>>>
>>>> The only comment I can make is that during initial investigations into
>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>> with error messages like:
>>>>
>>>>   configure: checking whether cluster support is available
>>>> checking for ctdb.h... yes
>>>> checking for ctdb_private.h... yes
>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>> configure: error: "cluster support not available: support for
>>>> SCHEDULE_FOR_DELETION control missing"
>>>>
>>>>
>>>> What occurs to me is that this message seems to indicate that it is
>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>> help, of course it might not and make backing out harder.
>>>
>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>> The versioning in CTDB has proved hard for me to fathom...
>>>
>>>>
>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>> meeting first!
>>>>
>>>
>>> It has to be said - the timing is good!
>>> Cheers,
>>> Orlando
>>>
>>>>
>>>> Thanks
>>>>
>>>> Bob
>>>>
>>>>
>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>
>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>
>>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>     performance over the last few weeks, likely triggered either by an
>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>> result),
>>>>     or possibly by additional users coming on with a new workload.
>>>>
>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>     that a roll back would fix the issue).
>>>>
>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>> for
>>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>> bang",
>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>
>>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>>     from the cluster - and these are both fine throughout.
>>>>
>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>     and everything started behaving fine - which is where we are now.
>>>>
>>>>     So, I've got a bunch of questions!
>>>>
>>>>       - does anyone know why ctdb would be spawning these processes,
>>>> and
>>>>     if there's anything we can do to stop it needing to do it?
>>>>       - has anyone done any more general performance / config
>>>>     optimisation of CTDB?
>>>>
>>>>     And - more generally - does anyone else actually use
>>>> ctdb/samba/gpfs
>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>
>>>>
>>>>     --
>>>>                  --
>>>>         Dr Orlando Richards
>>>>        Information Services
>>>>     IT Infrastructure Division
>>>>             Unix Section
>>>>          Tel: 0131 650 4994
>>>>
>>>>     The University of Edinburgh is a charitable body, registered in
>>>>     Scotland, with registration number SC005336.
>>>>     _________________________________________________
>>>>     gpfsug-discuss mailing list
>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bob Cregan
>>>>
>>>> Senior Storage Systems Administrator
>>>>
>>>> ACRC
>>>>
>>>> Bristol University
>>>>
>>>> Tel:     +44 (0) 117 331 4406
>>>>
>>>> skype:  bobcregan
>>>>
>>>> Mobile: +44 (0) 7712388129
>>>>
>>>
>>>
>>> --
>>>             --
>>>    Dr Orlando Richards
>>>   Information Services
>>> IT Infrastructure Division
>>>        Unix Section
>>>     Tel: 0131 650 4994
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From orlando.richards at ed.ac.uk  Mon Apr 22 15:52:55 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 22 Apr 2013 15:52:55 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516E79C8.8090603@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk>
Message-ID: <51754EC7.8000600@ed.ac.uk>

On 17/04/13 11:30, Orlando Richards wrote:
> Hi All - an update to this,
>
> After re-initialising the databases on Monday, things did seem to be
> running better, but ultimately we got back to suffering from spikes in
> ctdb processes and corresponding "pauses" in service. We fell back to a
> single node again for Tuesday (and things were stable once again), and
> this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was
> rebuilt against CTDB 1.2.61 headers).
>
> Things seem to be stable for now - more so than on Monday.
>
> For the record - one metric I'm watching is the number of ctdb processes
> running (this would spike to > 1000 under the failure conditions). It's
> currently sitting consistently at 3 processes, with occasional blips of
> 5-7 processes.
>


Hi all,

Looks like things have been running fine since we upgraded ctdb last 
Wednesday, so I think it's safe to say that we've found a fix for our 
problem in CTDB 1.2.61.

Thanks for all the input! If anyone wants more info, feel free to get in 
touch.


--
Orlando

> --
> Orlando
>
>
>
>
>
> On 15/04/13 10:54, Orlando Richards wrote:
>> On 12/04/13 19:44, Vic Cornell wrote:
>>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>>
>> No - but considered it. However, the only "live" CTDB file that sits on
>> GPFS is the reclock file, which - I think - is only used as the
>> heartbeat between nodes and for the recovery process. Now, there's
>> mileage in insulating that, certainly, but I don't think that's what
>> we're suffering from here.
>>
>> On a positive note - we took the steps this morning to re-initialise the
>> ctdb databases from current data, and things seem to be stable today so
>> far.
>>
>> Basically - shut down ctdb on all but one node. On all but that node, do:
>> mv /var/ctdb/ /var/ctdb.save.date
>>
>> then start up ctdb on those nodes. Once they've come up, shut down ctdb
>> on the last node, move /var/ctdb out the way, and restart. That brings
>> them all up with freshly compacted databases.
>>
>> Also, from the samba-technical mailing list came the advice to use a
>> more recent ctdb - specifically, 1.2.61. I've got that built and ready
>> to go (and a rebuilt samba compiled against it too), but if things prove
>> to be stable after today's compacting, then we will probably leave it at
>> that and not deploy this.
>>
>> Interesting that 2.0 wasn't suggested for "stable", and that the current
>> "dev" version is 2.1.
>>
>> For reference, here's the start of the thread:
>> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>>
>> --
>> Orlando.
>>
>>
>>
>>>
>>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>>> wrote:
>>>
>>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>>> Hi Orlando,
>>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>>
>>>>> ctdb - 1.0.99
>>>>> samba 3.5.15
>>>>>
>>>>> Both compiled from source. We have about 300+ users normally.
>>>>>
>>>>
>>>> We have suspicions that 3.6 has put additional "chatter" into the
>>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>>> has found that the clustered locking databases, in particular, prove
>>>> to be a scalability/usability limit for ctdb.
>>>>
>>>>
>>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>>> bad moments over the last year . These have gone away since we have
>>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>>> There
>>>>> have been no issues with samba/ctdb
>>>>>
>>>>> The only comment I can make is that during initial investigations into
>>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>>> with error messages like:
>>>>>
>>>>>   configure: checking whether cluster support is available
>>>>> checking for ctdb.h... yes
>>>>> checking for ctdb_private.h... yes
>>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>>> configure: error: "cluster support not available: support for
>>>>> SCHEDULE_FOR_DELETION control missing"
>>>>>
>>>>>
>>>>> What occurs to me is that this message seems to indicate that it is
>>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>>> help, of course it might not and make backing out harder.
>>>>
>>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>>> The versioning in CTDB has proved hard for me to fathom...
>>>>
>>>>>
>>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>>> meeting first!
>>>>>
>>>>
>>>> It has to be said - the timing is good!
>>>> Cheers,
>>>> Orlando
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>>
>>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>>
>>>>>     We've long been using CTDB and Samba for our NAS service,
>>>>> servicing
>>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>>     performance over the last few weeks, likely triggered either by an
>>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>>> result),
>>>>>     or possibly by additional users coming on with a new workload.
>>>>>
>>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>>     that a roll back would fix the issue).
>>>>>
>>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>>> for
>>>>>     10-60 seconds, and on the servers a corresponding spawning of
>>>>> large
>>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>>> bang",
>>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>>
>>>>>     We also serve up NFS from the same ctdb-managed frontends, and
>>>>> GPFS
>>>>>     from the cluster - and these are both fine throughout.
>>>>>
>>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>>     and everything started behaving fine - which is where we are now.
>>>>>
>>>>>     So, I've got a bunch of questions!
>>>>>
>>>>>       - does anyone know why ctdb would be spawning these processes,
>>>>> and
>>>>>     if there's anything we can do to stop it needing to do it?
>>>>>       - has anyone done any more general performance / config
>>>>>     optimisation of CTDB?
>>>>>
>>>>>     And - more generally - does anyone else actually use
>>>>> ctdb/samba/gpfs
>>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>>
>>>>>
>>>>>     --
>>>>>                  --
>>>>>         Dr Orlando Richards
>>>>>        Information Services
>>>>>     IT Infrastructure Division
>>>>>             Unix Section
>>>>>          Tel: 0131 650 4994
>>>>>
>>>>>     The University of Edinburgh is a charitable body, registered in
>>>>>     Scotland, with registration number SC005336.
>>>>>     _________________________________________________
>>>>>     gpfsug-discuss mailing list
>>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bob Cregan
>>>>>
>>>>> Senior Storage Systems Administrator
>>>>>
>>>>> ACRC
>>>>>
>>>>> Bristol University
>>>>>
>>>>> Tel:     +44 (0) 117 331 4406
>>>>>
>>>>> skype:  bobcregan
>>>>>
>>>>> Mobile: +44 (0) 7712388129
>>>>>
>>>>
>>>>
>>>> --
>>>>             --
>>>>    Dr Orlando Richards
>>>>   Information Services
>>>> IT Infrastructure Division
>>>>        Unix Section
>>>>     Tel: 0131 650 4994
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 10:38:07 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 10:38:07 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
Message-ID: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>

Hi all

Good to see lots of you at the user group meeting yesterday. Great work,
Jez!

We're setting up a test cluster here at Realise, with a view to moving our
main storage over from Gluster.

We're running the test cluster on Isilon hardware ... a couple of 1920
nodes that we were using for home dirs. Each node has dual gigabit ethernet
ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB
RAM. All good stuff and should make a nice test rig.

I have a few questions!

1.  We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah
to 3.5?
2.  I'm having trouble assigning NSDs. I have a descfile which looks like:

#DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1

but the command

"mmcrnsd -F /tmp/descfile -v no"

just craps out with

mmcrnsd: Processing disk sdc1
mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license
designation.
mmcrnsd: Error found while checking disk descriptor
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
mmcrnsd: Command failed.  Examine previous error messages to determine
cause.

Any help pointing me gently in the right direction would be much
appreciated. :-)

TIA

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/671e3a7e/attachment-0002.htm>

From orlando.richards at ed.ac.uk  Thu Apr 25 10:48:30 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Thu, 25 Apr 2013 10:48:30 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <5178FBEE.4070200@ed.ac.uk>

On 25/04/13 10:38, Pete Smith wrote:
> Hi all
>
> Good to see lots of you at the user group meeting yesterday. Great work,
> Jez!
>
> We're setting up a test cluster here at Realise, with a view to moving
> our main storage over from Gluster.
>
> We're running the test cluster on Isilon hardware ... a couple of 1920
> nodes that we were using for home dirs. Each node has dual gigabit
> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
> and and 4GB RAM. All good stuff and should make a nice test rig.
>
> I have a few questions!
>
> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?
> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>
> #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
>
> but the command
>
> "mmcrnsd -F /tmp/descfile -v no"
>
> just craps out with
>
> mmcrnsd: Processing disk sdc1
> mmcrnsd: Node gpfs001.realisestudio.com
> <http://gpfs001.realisestudio.com> does not have a GPFS server license
> designation.
> mmcrnsd: Error found while checking disk descriptor
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
> mmcrnsd: Command failed.  Examine previous error messages to determine
> cause.
>

mmchlicense server -N gpfs001.realisestudio.com should sort that one out.


> Any help pointing me gently in the right direction would be much
> appreciated. :-)
>
> TIA
>
> --
> Pete Smith
> DevOp/System Administrator
> Realise Studio
> 12/13 Poland Street, London W1F 8QB
> T. +44 (0)20 7165 9644
>
> realisestudio.com <http://realisestudio.com>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 11:05:36 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 11:05:36 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <5178FBEE.4070200@ed.ac.uk>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
	<5178FBEE.4070200@ed.ac.uk>
Message-ID: <CAM9ZKkiZ-7xUKZT3TyNDHJByufV+aqF+UQN=8EUw_pF2y8D=JA@mail.gmail.com>

Thanks Orlando. Much appreciated.


On 25 April 2013 10:48, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 25/04/13 10:38, Pete Smith wrote:
>
>> Hi all
>>
>> Good to see lots of you at the user group meeting yesterday. Great work,
>> Jez!
>>
>> We're setting up a test cluster here at Realise, with a view to moving
>> our main storage over from Gluster.
>>
>> We're running the test cluster on Isilon hardware ... a couple of 1920
>> nodes that we were using for home dirs. Each node has dual gigabit
>> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
>> and and 4GB RAM. All good stuff and should make a nice test rig.
>>
>> I have a few questions!
>>
>> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
>> 3.3.blah to 3.5?
>> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>>
>> #DiskName:PrimaryServer:**BackupServer:DiskUsage:**
>> FailureGroup:DesiredName:**StoragePool
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>>
>> but the command
>>
>> "mmcrnsd -F /tmp/descfile -v no"
>>
>> just craps out with
>>
>> mmcrnsd: Processing disk sdc1
>> mmcrnsd: Node gpfs001.realisestudio.com
>> <http://gpfs001.realisestudio.**com <http://gpfs001.realisestudio.com>>
>> does not have a GPFS server license
>> designation.
>> mmcrnsd: Error found while checking disk descriptor
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>> mmcrnsd: Command failed.  Examine previous error messages to determine
>> cause.
>>
>>
> mmchlicense server -N gpfs001.realisestudio.com should sort that one out.
>
>
>  Any help pointing me gently in the right direction would be much
>> appreciated. :-)
>>
>> TIA
>>
>> --
>> Pete Smith
>> DevOp/System Administrator
>> Realise Studio
>> 12/13 Poland Street, London W1F 8QB
>> T. +44 (0)20 7165 9644
>>
>> realisestudio.com <http://realisestudio.com>
>>
>>
>> ______________________________**_________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>
>>
>
> --
>             --
>    Dr Orlando Richards
>   Information Services
> IT Infrastructure Division
>        Unix Section
>     Tel: 0131 650 4994
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> ______________________________**_________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>


-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/c132d89e/attachment-0002.htm>

From pete at realisestudio.com  Fri Apr 26 16:06:38 2013
From: pete at realisestudio.com (Pete Smith)
Date: Fri, 26 Apr 2013 16:06:38 +0100
Subject: [gpfsug-discuss] GPS Native RAID on linux?
Message-ID: <CAM9ZKkh0kGKKNnxBoogcN_y-TUyKGUypu-io8qfgs1rXrV0GnQ@mail.gmail.com>

Hi

I thought from the presentation that this was available on linux ... but
documentation seems to indicate IBM GSS only?

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130426/872c1ad8/attachment-0002.htm>

From stuartb at 4gh.net  Tue Apr 30 21:50:38 2013
From: stuartb at 4gh.net (Stuart Barkley)
Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT)
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <alpine.BSF.2.00.1304301643540.4313@freeman.4gh.net>

On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote:

> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?

We are in transition to 3.5 on our original GPFS installation.  Two of
four servers are now at GPFS 3.4.XX/CentOS 6.4.  Two servers are still
at 3.3.YY/CentOS 5.4.  The compute nodes are all to 3.4.XX/CentOS 6.4.

The data center is remotely located and it is a pain to get physical
access.  Once we get the last two nodes upgraded, we expect to go to
GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS
installation).

My understanding is that you need to step through 3.4 during a
migration from 3.3 to 3.5.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone


From bdeluca at gmail.com  Wed Apr  3 10:57:05 2013
From: bdeluca at gmail.com (Ben De Luca)
Date: Wed, 3 Apr 2013 10:57:05 +0100
Subject: [gpfsug-discuss] mmbackup and management classes
Message-ID: <CAGC__DhH-O8oxSY8t=xXyfV7+M6x-sfjwGP7AN3c_YKrDWm-Ug@mail.gmail.com>

Hi gpfsusers,
       My first post to the list, Hi!

We tsm for our backups of our gpfs filesystems, we are looking at using the
mmbackup for script for launching our backups.

>From conversations with other people we hear that support for  management
classes may not be completely available in mmbackup?

I wondered if any one could comment on using mmbackup, and what and what
not is supported. Any gotchas?


-bd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130403/1f70a7e1/attachment-0003.htm>

From AHMADYH at sa.ibm.com  Wed Apr  3 13:04:47 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Wed, 3 Apr 2013 16:04:47 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/08/2013)
Message-ID: <OF2EDD4A8D.D01283B2-ON44257B42.00425B77-44257B42.00425B77@ae.ibm.com>


I am out of the office until 04/08/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From chris_stone at uk.ibm.com  Wed Apr  3 16:08:39 2013
From: chris_stone at uk.ibm.com (Chris Stone)
Date: Wed, 3 Apr 2013 16:08:39 +0100
Subject: [gpfsug-discuss] AUTO: Chris Stone/UK/IBM is out of the office
 until 16/08/2004. (returning 11/04/2013)
Message-ID: <OF7A8A96D8.7012B265-ON80257B42.005330B4-80257B42.005330B4@uk.ibm.com>


I am out of the office until 11/04/2013.

In an emergency please contact my manager Aniket Patel on :+44 (0) 7736 017
418


Note: This is an automated response to your message  "[gpfsug-discuss]
mmbackup and management classes" sent on 03/04/2013 10:57:05.

This is the only notification you will receive while this person is away.


From ANDREWD at uk.ibm.com  Wed Apr  3 16:10:26 2013
From: ANDREWD at uk.ibm.com (Andrew Downes1)
Date: Wed, 3 Apr 2013 16:10:26 +0100
Subject: [gpfsug-discuss] AUTO: Andrew Downes is out of the office
	(returning 08/04/2013)
Message-ID: <OFB0FCE5CF.87D63AFD-ON80257B42.00535A6C-80257B42.00535A6C@uk.ibm.com>


I am out of the office until 08/04/2013.

If anything is too urgent to wait  for my return please contact Matt Ayres
mailto:m_ayres at uk.ibm.com 44-7710-981527

In case of urgency, please contact our manager Dave Shave-Wall
mailto:dave_shavewall at uk.ibm.com 44-7740-921623


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 1" sent on 03/04/2013 12:00:02.

This is the only notification you will receive while this person is away.


From ashish.thandavan at cs.ox.ac.uk  Thu Apr 11 10:58:41 2013
From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan)
Date: Thu, 11 Apr 2013 10:58:41 +0100
Subject: [gpfsug-discuss] Register now: Spring GPFS User Group arranged
In-Reply-To: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
References: <A42128435E851644B9B011BB824F6C81614669679F@MAIL.ocf.local>
Message-ID: <51668951.7040506@cs.ox.ac.uk>

Dear Claire,

I trust you are well! If there are any spaces left, could you please 
register me for the event?

Thank you!

Regards,
Ash

On 25/03/13 14:38, Claire Robson wrote:
>
> Dear All,
>
> The next meeting date is set for *Wednesday 24^th April* and will be 
> taking place at the fantastic Dolby Studios in London (Dolby Europe 
> Limited, 4--6 Soho Square, London W1D 3PZ).
>
> *Getting to Dolby Europe Limited, Soho Square, London*
>
> Leave the Tottenham Court Road tube station by the South Oxford Street 
> exit [Exit 1].
>
> Turn left onto Oxford Street.
>
> After about 50m turn left into Soho Street.
>
> Turn right into Soho Square.
>
> 4-6 Soho Square is directly in front of you.
>
> Our tentative agenda is as follows:
>
> 10:30     Arrivals and refreshments
>
> 11:00     Introductions and committee updates
>
> Jez Tucker, Group Chair & Claire Robson, Group Secretary
>
> 11:05     GPFS OpenStack Integration
>
> Prasenhit Sarkar, IBM Almaden Research Labs
>
>                GPFS FPO
>
>                Dinesh Subhraveti, IBM Almaden Research Labs
>
> 11:45     SAMBA 4.0 & CTDB 2.0
>
>                Michael Adams, SAMBA Development Team
>
> 12:15     SAMBA & GPFS Integration
>
>                Volker Lendecke, SAMBA Development Team
>
> 13:00     Lunch (Buffet provided)
>
> 14:00     GPFS Native RAID & LTFS
>
> Jim Roche, IBM
>
> 14:45     User Stories
>
> 15:45     Group discussion: Challenges, experiences and questions & 
> Committee matters
>
> Led by Jez Tucker, Group Chairperson
>
> 16:00     Close
>
> We will be starting at 11:00am and concluding at 4pm but some of the 
> speaker timings may alter slightly. I will be posting further details 
> on what the presentations cover over the coming week or so.
>
> We hope you can make it for what will be a really interesting day of 
> GPFS discussions. *Please register with me if you would like to 
> attend* -- registrations are based on a first come first served basis.
>
> Best regards,
>
> *Claire Robson*
>
> GPFS User Group Secreatry
>
> Tel: 0114 257 2200
>
> Mob: 07508 033896
>
> Fax: 0114 257 0022
>
> Web: _www.gpfsug.org <http://www.gpfsug.org>_
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
-------------------------
Ashish Thandavan

UNIX Support Computing Officer
Department of Computer Science
University of Oxford
Wolfson Building
Parks Road
Oxford OX1 3QD

Phone: 01865 610733
Email: ashish.thandavan at cs.ox.ac.uk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130411/079dd991/attachment-0003.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 13:37:52 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 13:37:52 +0100
Subject: [gpfsug-discuss] CTDB woes
Message-ID: <51680020.4040509@ed.ac.uk>

Hi folks,

We've long been using CTDB and Samba for our NAS service, servicing ~500 
users. We've been suffering from some problems with the CTDB performance 
over the last few weeks, likely triggered either by an upgrade of samba 
from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by 
additional users coming on with a new workload.

We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from 
sernet). Before we roll back, we'd like to make sure we can't fix the 
problem and stick with Samba 3.6 (and we don't even know that a roll 
back would fix the issue).

The symptoms are a complete freeze of the service for CIFS users for 
10-60 seconds, and on the servers a corresponding spawning of large 
numbers of CTDB processes, which seem to be created in a "big bang", and 
then do what they do and exit in the subsequent 10-60 seconds.

We also serve up NFS from the same ctdb-managed frontends, and GPFS from 
the cluster - and these are both fine throughout.

This was happening 5-10 times per hour, not at exact intervals though. 
When we added a third node to the CTDB cluster, it "got worse", and when 
we dropped the CTDB cluster down to a single node and everything started 
behaving fine - which is where we are now.

So, I've got a bunch of questions!

  - does anyone know why ctdb would be spawning these processes, and if 
there's anything we can do to stop it needing to do it?
  - has anyone done any more general performance / config optimisation 
of CTDB?

And - more generally - does anyone else actually use ctdb/samba/gpfs on 
the scale of ~500 users or higher? If so - how do you find it?


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From Tobias.Kuebler at sva.de  Fri Apr 12 14:03:58 2013
From: Tobias.Kuebler at sva.de (Tobias.Kuebler at sva.de)
Date: Fri, 12 Apr 2013 15:03:58 +0200
Subject: [gpfsug-discuss] =?iso-8859-1?q?AUTO=3A_Tobias_Kuebler_ist_au=DFe?=
 =?iso-8859-1?q?r_Haus_=28R=FCckkehr_am_Mo=2C_04/15/2013=29?=
Message-ID: <OFA98F07BD.090D53A5-ONC1257B4B.0047C660-C1257B4B.0047C660@sva.de>


Ich bin von Do, 04/11/2013 bis Mo, 04/15/2013 abwesend.

Vielen Dank f?r Ihre Nachricht.
Ankommende E-Mails werden w?hrend meiner Abwesenheit nicht weitergeleitet,
ich versuche Sie jedoch m?glichst rasch nach meiner R?ckkehr zu
beantworten.
In dringenden F?llen wenden Sie sich bitte an Ihren zust?ndigen
Vertriebsbeauftragten.


Hinweis: Dies ist eine automatische Antwort auf Ihre Nachricht
"[gpfsug-discuss] CTDB woes" gesendet am 12.04.2013 14:37:52.

Diese ist die einzige Benachrichtigung, die Sie empfangen werden, w?hrend
diese Person abwesend ist.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130412/36d2a4aa/attachment-0003.htm>

From orlando.richards at ed.ac.uk  Fri Apr 12 16:43:44 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Fri, 12 Apr 2013 16:43:44 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
Message-ID: <51682BB0.7010507@ed.ac.uk>

On 12/04/13 15:43, Bob Cregan wrote:
> Hi Orlando,
>                        We use ctdb/samba for CIFS, and CNFS for NFS
> (GPFS version 3.4.0-13) . Current versions are
>
> ctdb - 1.0.99
> samba 3.5.15
>
> Both compiled from source. We have about 300+ users normally.
>

We have suspicions that 3.6 has put additional "chatter" into the ctdb 
database stream, which has pushed us over the edge. Barry Evans has 
found that the clustered locking databases, in particular, prove to be a 
scalability/usability limit for ctdb.


> We have had no issues with this setup apart from CNFS which had 2 or 3
> bad moments over the last year . These have gone away since we have
> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
> be2net) which lead to occasional dropped packets for jumbo frames. There
> have been no issues with samba/ctdb
>
> The only comment I can make is that during initial investigations into
> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
> with error messages like:
>
>   configure: checking whether cluster support is available
> checking for ctdb.h... yes
> checking for ctdb_private.h... yes
> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
> configure: error: "cluster support not available: support for
> SCHEDULE_FOR_DELETION control missing"
>
>
> What occurs to me is that this message seems to indicate that it is
> possible to run  a ctdb version that is incompatible with samba 3.6.
>   That would imply that an upgrade to a higher version of ctdb might
> help, of course it might not and make backing out harder.

Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The 
versioning in CTDB has proved hard for me to fathom...

>
> A compile against ctdb 2.0 works fine. We will soon be running in this
> upgrade, but I'm waiting to see what the samba  people say at the UG
> meeting first!
>

It has to be said - the timing is good!
Cheers,
Orlando

>
> Thanks
>
> Bob
>
>
> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
> <mailto:orlando.richards at ed.ac.uk>> wrote:
>
>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>
>     We've long been using CTDB and Samba for our NAS service, servicing
>     ~500 users. We've been suffering from some problems with the CTDB
>     performance over the last few weeks, likely triggered either by an
>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>     or possibly by additional users coming on with a new workload.
>
>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>     from sernet). Before we roll back, we'd like to make sure we can't
>     fix the problem and stick with Samba 3.6 (and we don't even know
>     that a roll back would fix the issue).
>
>     The symptoms are a complete freeze of the service for CIFS users for
>     10-60 seconds, and on the servers a corresponding spawning of large
>     numbers of CTDB processes, which seem to be created in a "big bang",
>     and then do what they do and exit in the subsequent 10-60 seconds.
>
>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>     from the cluster - and these are both fine throughout.
>
>     This was happening 5-10 times per hour, not at exact intervals
>     though. When we added a third node to the CTDB cluster, it "got
>     worse", and when we dropped the CTDB cluster down to a single node
>     and everything started behaving fine - which is where we are now.
>
>     So, I've got a bunch of questions!
>
>       - does anyone know why ctdb would be spawning these processes, and
>     if there's anything we can do to stop it needing to do it?
>       - has anyone done any more general performance / config
>     optimisation of CTDB?
>
>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>     on the scale of ~500 users or higher? If so - how do you find it?
>
>
>     --
>                  --
>         Dr Orlando Richards
>        Information Services
>     IT Infrastructure Division
>             Unix Section
>          Tel: 0131 650 4994
>
>     The University of Edinburgh is a charitable body, registered in
>     Scotland, with registration number SC005336.
>     _________________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
>
>
>
> --
>
> Bob Cregan
>
> Senior Storage Systems Administrator
>
> ACRC
>
> Bristol University
>
> Tel:     +44 (0) 117 331 4406
>
> skype:  bobcregan
>
> Mobile: +44 (0) 7712388129
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From viccornell at gmail.com  Fri Apr 12 19:44:16 2013
From: viccornell at gmail.com (Vic Cornell)
Date: Fri, 12 Apr 2013 19:44:16 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <51682BB0.7010507@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
Message-ID: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>

Have you tried putting the ctdb files onto a separate gpfs filesystem?

Vic Cornell
viccornell at gmail.com


On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 12/04/13 15:43, Bob Cregan wrote:
>> Hi Orlando,
>>                       We use ctdb/samba for CIFS, and CNFS for NFS
>> (GPFS version 3.4.0-13) . Current versions are
>> 
>> ctdb - 1.0.99
>> samba 3.5.15
>> 
>> Both compiled from source. We have about 300+ users normally.
>> 
> 
> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
> 
> 
>> We have had no issues with this setup apart from CNFS which had 2 or 3
>> bad moments over the last year . These have gone away since we have
>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>> be2net) which lead to occasional dropped packets for jumbo frames. There
>> have been no issues with samba/ctdb
>> 
>> The only comment I can make is that during initial investigations into
>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>> with error messages like:
>> 
>>  configure: checking whether cluster support is available
>> checking for ctdb.h... yes
>> checking for ctdb_private.h... yes
>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>> configure: error: "cluster support not available: support for
>> SCHEDULE_FOR_DELETION control missing"
>> 
>> 
>> What occurs to me is that this message seems to indicate that it is
>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>  That would imply that an upgrade to a higher version of ctdb might
>> help, of course it might not and make backing out harder.
> 
> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
> 
>> 
>> A compile against ctdb 2.0 works fine. We will soon be running in this
>> upgrade, but I'm waiting to see what the samba  people say at the UG
>> meeting first!
>> 
> 
> It has to be said - the timing is good!
> Cheers,
> Orlando
> 
>> 
>> Thanks
>> 
>> Bob
>> 
>> 
>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>> 
>>    Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>> 
>>    We've long been using CTDB and Samba for our NAS service, servicing
>>    ~500 users. We've been suffering from some problems with the CTDB
>>    performance over the last few weeks, likely triggered either by an
>>    upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>    or possibly by additional users coming on with a new workload.
>> 
>>    We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>    from sernet). Before we roll back, we'd like to make sure we can't
>>    fix the problem and stick with Samba 3.6 (and we don't even know
>>    that a roll back would fix the issue).
>> 
>>    The symptoms are a complete freeze of the service for CIFS users for
>>    10-60 seconds, and on the servers a corresponding spawning of large
>>    numbers of CTDB processes, which seem to be created in a "big bang",
>>    and then do what they do and exit in the subsequent 10-60 seconds.
>> 
>>    We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>    from the cluster - and these are both fine throughout.
>> 
>>    This was happening 5-10 times per hour, not at exact intervals
>>    though. When we added a third node to the CTDB cluster, it "got
>>    worse", and when we dropped the CTDB cluster down to a single node
>>    and everything started behaving fine - which is where we are now.
>> 
>>    So, I've got a bunch of questions!
>> 
>>      - does anyone know why ctdb would be spawning these processes, and
>>    if there's anything we can do to stop it needing to do it?
>>      - has anyone done any more general performance / config
>>    optimisation of CTDB?
>> 
>>    And - more generally - does anyone else actually use ctdb/samba/gpfs
>>    on the scale of ~500 users or higher? If so - how do you find it?
>> 
>> 
>>    --
>>                 --
>>        Dr Orlando Richards
>>       Information Services
>>    IT Infrastructure Division
>>            Unix Section
>>         Tel: 0131 650 4994
>> 
>>    The University of Edinburgh is a charitable body, registered in
>>    Scotland, with registration number SC005336.
>>    _________________________________________________
>>    gpfsug-discuss mailing list
>>    gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>    http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>    <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>> 
>> 
>> 
>> 
>> --
>> 
>> Bob Cregan
>> 
>> Senior Storage Systems Administrator
>> 
>> ACRC
>> 
>> Bristol University
>> 
>> Tel:     +44 (0) 117 331 4406
>> 
>> skype:  bobcregan
>> 
>> Mobile: +44 (0) 7712388129
>> 
> 
> 
> -- 
>            --
>   Dr Orlando Richards
>  Information Services
> IT Infrastructure Division
>       Unix Section
>    Tel: 0131 650 4994
> 
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


From orlando.richards at ed.ac.uk  Mon Apr 15 10:54:39 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 15 Apr 2013 10:54:39 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
Message-ID: <516BCE5F.8010309@ed.ac.uk>

On 12/04/13 19:44, Vic Cornell wrote:
> Have you tried putting the ctdb files onto a separate gpfs filesystem?

No - but considered it. However, the only "live" CTDB file that sits on 
GPFS is the reclock file, which - I think - is only used as the 
heartbeat between nodes and for the recovery process. Now, there's 
mileage in insulating that, certainly, but I don't think that's what 
we're suffering from here.

On a positive note - we took the steps this morning to re-initialise the 
ctdb databases from current data, and things seem to be stable today so far.

Basically - shut down ctdb on all but one node. On all but that node, do:
mv /var/ctdb/ /var/ctdb.save.date

then start up ctdb on those nodes. Once they've come up, shut down ctdb 
on the last node, move /var/ctdb out the way, and restart. That brings 
them all up with freshly compacted databases.

Also, from the samba-technical mailing list came the advice to use a 
more recent ctdb - specifically, 1.2.61. I've got that built and ready 
to go (and a rebuilt samba compiled against it too), but if things prove 
to be stable after today's compacting, then we will probably leave it at 
that and not deploy this.

Interesting that 2.0 wasn't suggested for "stable", and that the current 
"dev" version is 2.1.

For reference, here's the start of the thread:
https://lists.samba.org/archive/samba-technical/2013-April/091525.html

--
Orlando.


>
> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk> wrote:
>
>> On 12/04/13 15:43, Bob Cregan wrote:
>>> Hi Orlando,
>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>> (GPFS version 3.4.0-13) . Current versions are
>>>
>>> ctdb - 1.0.99
>>> samba 3.5.15
>>>
>>> Both compiled from source. We have about 300+ users normally.
>>>
>>
>> We have suspicions that 3.6 has put additional "chatter" into the ctdb database stream, which has pushed us over the edge. Barry Evans has found that the clustered locking databases, in particular, prove to be a scalability/usability limit for ctdb.
>>
>>
>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>> bad moments over the last year . These have gone away since we have
>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>> be2net) which lead to occasional dropped packets for jumbo frames. There
>>> have been no issues with samba/ctdb
>>>
>>> The only comment I can make is that during initial investigations into
>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>> with error messages like:
>>>
>>>   configure: checking whether cluster support is available
>>> checking for ctdb.h... yes
>>> checking for ctdb_private.h... yes
>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>> configure: error: "cluster support not available: support for
>>> SCHEDULE_FOR_DELETION control missing"
>>>
>>>
>>> What occurs to me is that this message seems to indicate that it is
>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>   That would imply that an upgrade to a higher version of ctdb might
>>> help, of course it might not and make backing out harder.
>>
>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The versioning in CTDB has proved hard for me to fathom...
>>
>>>
>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>> meeting first!
>>>
>>
>> It has to be said - the timing is good!
>> Cheers,
>> Orlando
>>
>>>
>>> Thanks
>>>
>>> Bob
>>>
>>>
>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>
>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>
>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>     performance over the last few weeks, likely triggered either by an
>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
>>>     or possibly by additional users coming on with a new workload.
>>>
>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>     that a roll back would fix the issue).
>>>
>>>     The symptoms are a complete freeze of the service for CIFS users for
>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>     numbers of CTDB processes, which seem to be created in a "big bang",
>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>
>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>     from the cluster - and these are both fine throughout.
>>>
>>>     This was happening 5-10 times per hour, not at exact intervals
>>>     though. When we added a third node to the CTDB cluster, it "got
>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>     and everything started behaving fine - which is where we are now.
>>>
>>>     So, I've got a bunch of questions!
>>>
>>>       - does anyone know why ctdb would be spawning these processes, and
>>>     if there's anything we can do to stop it needing to do it?
>>>       - has anyone done any more general performance / config
>>>     optimisation of CTDB?
>>>
>>>     And - more generally - does anyone else actually use ctdb/samba/gpfs
>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>
>>>
>>>     --
>>>                  --
>>>         Dr Orlando Richards
>>>        Information Services
>>>     IT Infrastructure Division
>>>             Unix Section
>>>          Tel: 0131 650 4994
>>>
>>>     The University of Edinburgh is a charitable body, registered in
>>>     Scotland, with registration number SC005336.
>>>     _________________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Bob Cregan
>>>
>>> Senior Storage Systems Administrator
>>>
>>> ACRC
>>>
>>> Bristol University
>>>
>>> Tel:     +44 (0) 117 331 4406
>>>
>>> skype:  bobcregan
>>>
>>> Mobile: +44 (0) 7712388129
>>>
>>
>>
>> --
>>             --
>>    Dr Orlando Richards
>>   Information Services
>> IT Infrastructure Division
>>        Unix Section
>>     Tel: 0131 650 4994
>>
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From crobson at ocf.co.uk  Mon Apr 15 15:04:38 2013
From: crobson at ocf.co.uk (Claire Robson)
Date: Mon, 15 Apr 2013 15:04:38 +0100
Subject: [gpfsug-discuss] Latest agenda and places still available
Message-ID: <A42128435E851644B9B011BB824F6C81614669752C@MAIL.ocf.local>

Dear All,

Thank you to those who have expressed an interest in next Wednesday's GPFS user group meeting in London and registered a place. There are a few places still available, please register with me if you would like to attend.

This is the latest agenda for the day:
10:30     Arrivals and refreshments
11:00     Introductions and committee updates
Jez Tucker, Group Chair & Claire Robson, Group Secretary
11:05     GPFS FPO
Dinesh Subhraveti, IBM Almaden Research Labs
12:00     SAMBA 4.0 & CTDB 2.0
               Michael Adams, SAMBA Development Team
13:00     Lunch (Buffet provided)
13:45     GPFS OpenStack Integration
               Dinesh Subhraveti, IBM Almaden Research Labs
14:15     SAMBA & GPFS Integration
               Volker Lendecke, SAMBA Development Team
15:15     Refreshments break
15:30     GPFS Native RAID & LTFS
Jim Roche, IBM
16:00     Group discussion: Questions & Committee matters
Led by Jez Tucker, Group Chairperson
16:05     Close

I look forward to seeing many of you next week.

Kind regards,

Claire Robson
GPFS user group Secetary

Tel: 0114 257 2200
Mob: 07508 033896
Web: www.gpfsug.org<http://www.gpfsug.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130415/0c87469e/attachment-0003.htm>

From AHMADYH at sa.ibm.com  Tue Apr 16 13:08:58 2013
From: AHMADYH at sa.ibm.com (Ahmad Y Hussein)
Date: Tue, 16 Apr 2013 16:08:58 +0400
Subject: [gpfsug-discuss] AUTO: Ahmad Y Hussein is out of the office
	(returning 04/29/2013)
Message-ID: <OF891AC77C.D4A142FB-ON44257B4F.0042BD8B-44257B4F.0042BD8B@ae.ibm.com>


I am out of the office until 04/29/2013.

Dear Sender;
I am in a customer engagement with extremely limited email access, I will
respond to your emails as soon as i can.
For Urjent cases please call me on my mobile (+966542001289).
Thank you for understanding.

Regards;
Ahmad Y Hussein


Note: This is an automated response to your message  "gpfsug-discuss
Digest, Vol 16, Issue 6" sent on 16/04/2013 15:00:02.

This is the only notification you will receive while this person is away.


From orlando.richards at ed.ac.uk  Wed Apr 17 11:30:32 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Wed, 17 Apr 2013 11:30:32 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516BCE5F.8010309@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk>
Message-ID: <516E79C8.8090603@ed.ac.uk>

Hi All - an update to this,

After re-initialising the databases on Monday, things did seem to be 
running better, but ultimately we got back to suffering from spikes in 
ctdb processes and corresponding "pauses" in service. We fell back to a 
single node again for Tuesday (and things were stable once again), and 
this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was 
rebuilt against CTDB 1.2.61 headers).

Things seem to be stable for now - more so than on Monday.

For the record - one metric I'm watching is the number of ctdb processes 
running (this would spike to > 1000 under the failure conditions). It's 
currently sitting consistently at 3 processes, with occasional blips of 
5-7 processes.

--
Orlando


On 15/04/13 10:54, Orlando Richards wrote:
> On 12/04/13 19:44, Vic Cornell wrote:
>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>
> No - but considered it. However, the only "live" CTDB file that sits on
> GPFS is the reclock file, which - I think - is only used as the
> heartbeat between nodes and for the recovery process. Now, there's
> mileage in insulating that, certainly, but I don't think that's what
> we're suffering from here.
>
> On a positive note - we took the steps this morning to re-initialise the
> ctdb databases from current data, and things seem to be stable today so
> far.
>
> Basically - shut down ctdb on all but one node. On all but that node, do:
> mv /var/ctdb/ /var/ctdb.save.date
>
> then start up ctdb on those nodes. Once they've come up, shut down ctdb
> on the last node, move /var/ctdb out the way, and restart. That brings
> them all up with freshly compacted databases.
>
> Also, from the samba-technical mailing list came the advice to use a
> more recent ctdb - specifically, 1.2.61. I've got that built and ready
> to go (and a rebuilt samba compiled against it too), but if things prove
> to be stable after today's compacting, then we will probably leave it at
> that and not deploy this.
>
> Interesting that 2.0 wasn't suggested for "stable", and that the current
> "dev" version is 2.1.
>
> For reference, here's the start of the thread:
> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>
> --
> Orlando.
>
>
>
>>
>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>> wrote:
>>
>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>> Hi Orlando,
>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>
>>>> ctdb - 1.0.99
>>>> samba 3.5.15
>>>>
>>>> Both compiled from source. We have about 300+ users normally.
>>>>
>>>
>>> We have suspicions that 3.6 has put additional "chatter" into the
>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>> has found that the clustered locking databases, in particular, prove
>>> to be a scalability/usability limit for ctdb.
>>>
>>>
>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>> bad moments over the last year . These have gone away since we have
>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>> There
>>>> have been no issues with samba/ctdb
>>>>
>>>> The only comment I can make is that during initial investigations into
>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>> with error messages like:
>>>>
>>>>   configure: checking whether cluster support is available
>>>> checking for ctdb.h... yes
>>>> checking for ctdb_private.h... yes
>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>> configure: error: "cluster support not available: support for
>>>> SCHEDULE_FOR_DELETION control missing"
>>>>
>>>>
>>>> What occurs to me is that this message seems to indicate that it is
>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>> help, of course it might not and make backing out harder.
>>>
>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>> The versioning in CTDB has proved hard for me to fathom...
>>>
>>>>
>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>> meeting first!
>>>>
>>>
>>> It has to be said - the timing is good!
>>> Cheers,
>>> Orlando
>>>
>>>>
>>>> Thanks
>>>>
>>>> Bob
>>>>
>>>>
>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>
>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>
>>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>     performance over the last few weeks, likely triggered either by an
>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>> result),
>>>>     or possibly by additional users coming on with a new workload.
>>>>
>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>     that a roll back would fix the issue).
>>>>
>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>> for
>>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>> bang",
>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>
>>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>>     from the cluster - and these are both fine throughout.
>>>>
>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>     and everything started behaving fine - which is where we are now.
>>>>
>>>>     So, I've got a bunch of questions!
>>>>
>>>>       - does anyone know why ctdb would be spawning these processes,
>>>> and
>>>>     if there's anything we can do to stop it needing to do it?
>>>>       - has anyone done any more general performance / config
>>>>     optimisation of CTDB?
>>>>
>>>>     And - more generally - does anyone else actually use
>>>> ctdb/samba/gpfs
>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>
>>>>
>>>>     --
>>>>                  --
>>>>         Dr Orlando Richards
>>>>        Information Services
>>>>     IT Infrastructure Division
>>>>             Unix Section
>>>>          Tel: 0131 650 4994
>>>>
>>>>     The University of Edinburgh is a charitable body, registered in
>>>>     Scotland, with registration number SC005336.
>>>>     _________________________________________________
>>>>     gpfsug-discuss mailing list
>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bob Cregan
>>>>
>>>> Senior Storage Systems Administrator
>>>>
>>>> ACRC
>>>>
>>>> Bristol University
>>>>
>>>> Tel:     +44 (0) 117 331 4406
>>>>
>>>> skype:  bobcregan
>>>>
>>>> Mobile: +44 (0) 7712388129
>>>>
>>>
>>>
>>> --
>>>             --
>>>    Dr Orlando Richards
>>>   Information Services
>>> IT Infrastructure Division
>>>        Unix Section
>>>     Tel: 0131 650 4994
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From orlando.richards at ed.ac.uk  Mon Apr 22 15:52:55 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Mon, 22 Apr 2013 15:52:55 +0100
Subject: [gpfsug-discuss] CTDB woes
In-Reply-To: <516E79C8.8090603@ed.ac.uk>
References: <51680020.4040509@ed.ac.uk>
	<CALORruoCQbjue6qVJhkfayQCV51zvBqmonzgL4+-Khu0B=LKYw@mail.gmail.com>
	<51682BB0.7010507@ed.ac.uk>
	<271DA6EE-D64D-4DBC-9DFE-4335E55102D4@gmail.com>
	<516BCE5F.8010309@ed.ac.uk> <516E79C8.8090603@ed.ac.uk>
Message-ID: <51754EC7.8000600@ed.ac.uk>

On 17/04/13 11:30, Orlando Richards wrote:
> Hi All - an update to this,
>
> After re-initialising the databases on Monday, things did seem to be
> running better, but ultimately we got back to suffering from spikes in
> ctdb processes and corresponding "pauses" in service. We fell back to a
> single node again for Tuesday (and things were stable once again), and
> this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was
> rebuilt against CTDB 1.2.61 headers).
>
> Things seem to be stable for now - more so than on Monday.
>
> For the record - one metric I'm watching is the number of ctdb processes
> running (this would spike to > 1000 under the failure conditions). It's
> currently sitting consistently at 3 processes, with occasional blips of
> 5-7 processes.
>


Hi all,

Looks like things have been running fine since we upgraded ctdb last 
Wednesday, so I think it's safe to say that we've found a fix for our 
problem in CTDB 1.2.61.

Thanks for all the input! If anyone wants more info, feel free to get in 
touch.


--
Orlando

> --
> Orlando
>
>
>
>
>
> On 15/04/13 10:54, Orlando Richards wrote:
>> On 12/04/13 19:44, Vic Cornell wrote:
>>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>>
>> No - but considered it. However, the only "live" CTDB file that sits on
>> GPFS is the reclock file, which - I think - is only used as the
>> heartbeat between nodes and for the recovery process. Now, there's
>> mileage in insulating that, certainly, but I don't think that's what
>> we're suffering from here.
>>
>> On a positive note - we took the steps this morning to re-initialise the
>> ctdb databases from current data, and things seem to be stable today so
>> far.
>>
>> Basically - shut down ctdb on all but one node. On all but that node, do:
>> mv /var/ctdb/ /var/ctdb.save.date
>>
>> then start up ctdb on those nodes. Once they've come up, shut down ctdb
>> on the last node, move /var/ctdb out the way, and restart. That brings
>> them all up with freshly compacted databases.
>>
>> Also, from the samba-technical mailing list came the advice to use a
>> more recent ctdb - specifically, 1.2.61. I've got that built and ready
>> to go (and a rebuilt samba compiled against it too), but if things prove
>> to be stable after today's compacting, then we will probably leave it at
>> that and not deploy this.
>>
>> Interesting that 2.0 wasn't suggested for "stable", and that the current
>> "dev" version is 2.1.
>>
>> For reference, here's the start of the thread:
>> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>>
>> --
>> Orlando.
>>
>>
>>
>>>
>>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>>> wrote:
>>>
>>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>>> Hi Orlando,
>>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>>
>>>>> ctdb - 1.0.99
>>>>> samba 3.5.15
>>>>>
>>>>> Both compiled from source. We have about 300+ users normally.
>>>>>
>>>>
>>>> We have suspicions that 3.6 has put additional "chatter" into the
>>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>>> has found that the clustered locking databases, in particular, prove
>>>> to be a scalability/usability limit for ctdb.
>>>>
>>>>
>>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>>> bad moments over the last year . These have gone away since we have
>>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>>> There
>>>>> have been no issues with samba/ctdb
>>>>>
>>>>> The only comment I can make is that during initial investigations into
>>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>>> with error messages like:
>>>>>
>>>>>   configure: checking whether cluster support is available
>>>>> checking for ctdb.h... yes
>>>>> checking for ctdb_private.h... yes
>>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>>> configure: error: "cluster support not available: support for
>>>>> SCHEDULE_FOR_DELETION control missing"
>>>>>
>>>>>
>>>>> What occurs to me is that this message seems to indicate that it is
>>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>>> help, of course it might not and make backing out harder.
>>>>
>>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>>> The versioning in CTDB has proved hard for me to fathom...
>>>>
>>>>>
>>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>>> meeting first!
>>>>>
>>>>
>>>> It has to be said - the timing is good!
>>>> Cheers,
>>>> Orlando
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>>
>>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>>
>>>>>     We've long been using CTDB and Samba for our NAS service,
>>>>> servicing
>>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>>     performance over the last few weeks, likely triggered either by an
>>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>>> result),
>>>>>     or possibly by additional users coming on with a new workload.
>>>>>
>>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>>     that a roll back would fix the issue).
>>>>>
>>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>>> for
>>>>>     10-60 seconds, and on the servers a corresponding spawning of
>>>>> large
>>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>>> bang",
>>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>>
>>>>>     We also serve up NFS from the same ctdb-managed frontends, and
>>>>> GPFS
>>>>>     from the cluster - and these are both fine throughout.
>>>>>
>>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>>     and everything started behaving fine - which is where we are now.
>>>>>
>>>>>     So, I've got a bunch of questions!
>>>>>
>>>>>       - does anyone know why ctdb would be spawning these processes,
>>>>> and
>>>>>     if there's anything we can do to stop it needing to do it?
>>>>>       - has anyone done any more general performance / config
>>>>>     optimisation of CTDB?
>>>>>
>>>>>     And - more generally - does anyone else actually use
>>>>> ctdb/samba/gpfs
>>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>>
>>>>>
>>>>>     --
>>>>>                  --
>>>>>         Dr Orlando Richards
>>>>>        Information Services
>>>>>     IT Infrastructure Division
>>>>>             Unix Section
>>>>>          Tel: 0131 650 4994
>>>>>
>>>>>     The University of Edinburgh is a charitable body, registered in
>>>>>     Scotland, with registration number SC005336.
>>>>>     _________________________________________________
>>>>>     gpfsug-discuss mailing list
>>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bob Cregan
>>>>>
>>>>> Senior Storage Systems Administrator
>>>>>
>>>>> ACRC
>>>>>
>>>>> Bristol University
>>>>>
>>>>> Tel:     +44 (0) 117 331 4406
>>>>>
>>>>> skype:  bobcregan
>>>>>
>>>>> Mobile: +44 (0) 7712388129
>>>>>
>>>>
>>>>
>>>> --
>>>>             --
>>>>    Dr Orlando Richards
>>>>   Information Services
>>>> IT Infrastructure Division
>>>>        Unix Section
>>>>     Tel: 0131 650 4994
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 10:38:07 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 10:38:07 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
Message-ID: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>

Hi all

Good to see lots of you at the user group meeting yesterday. Great work,
Jez!

We're setting up a test cluster here at Realise, with a view to moving our
main storage over from Gluster.

We're running the test cluster on Isilon hardware ... a couple of 1920
nodes that we were using for home dirs. Each node has dual gigabit ethernet
ports, and dual infiniband ports. Single dual-core Xeon proc and and 4GB
RAM. All good stuff and should make a nice test rig.

I have a few questions!

1.  We're on centos6.4.x86_64. What's the easiest way to go from 3.3.blah
to 3.5?
2.  I'm having trouble assigning NSDs. I have a descfile which looks like:

#DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1

but the command

"mmcrnsd -F /tmp/descfile -v no"

just craps out with

mmcrnsd: Processing disk sdc1
mmcrnsd: Node gpfs001.realisestudio.com does not have a GPFS server license
designation.
mmcrnsd: Error found while checking disk descriptor
/dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
mmcrnsd: Command failed.  Examine previous error messages to determine
cause.

Any help pointing me gently in the right direction would be much
appreciated. :-)

TIA

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/671e3a7e/attachment-0003.htm>

From orlando.richards at ed.ac.uk  Thu Apr 25 10:48:30 2013
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Thu, 25 Apr 2013 10:48:30 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <5178FBEE.4070200@ed.ac.uk>

On 25/04/13 10:38, Pete Smith wrote:
> Hi all
>
> Good to see lots of you at the user group meeting yesterday. Great work,
> Jez!
>
> We're setting up a test cluster here at Realise, with a view to moving
> our main storage over from Gluster.
>
> We're running the test cluster on Isilon hardware ... a couple of 1920
> nodes that we were using for home dirs. Each node has dual gigabit
> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
> and and 4GB RAM. All good stuff and should make a nice test rig.
>
> I have a few questions!
>
> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?
> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>
> #DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
>
> but the command
>
> "mmcrnsd -F /tmp/descfile -v no"
>
> just craps out with
>
> mmcrnsd: Processing disk sdc1
> mmcrnsd: Node gpfs001.realisestudio.com
> <http://gpfs001.realisestudio.com> does not have a GPFS server license
> designation.
> mmcrnsd: Error found while checking disk descriptor
> /dev/sdc1:gpfs001.realisestudio.com::dataAndMetadata:1
> mmcrnsd: Command failed.  Examine previous error messages to determine
> cause.
>

mmchlicense server -N gpfs001.realisestudio.com should sort that one out.


> Any help pointing me gently in the right direction would be much
> appreciated. :-)
>
> TIA
>
> --
> Pete Smith
> DevOp/System Administrator
> Realise Studio
> 12/13 Poland Street, London W1F 8QB
> T. +44 (0)20 7165 9644
>
> realisestudio.com <http://realisestudio.com>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From pete at realisestudio.com  Thu Apr 25 11:05:36 2013
From: pete at realisestudio.com (Pete Smith)
Date: Thu, 25 Apr 2013 11:05:36 +0100
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <5178FBEE.4070200@ed.ac.uk>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
	<5178FBEE.4070200@ed.ac.uk>
Message-ID: <CAM9ZKkiZ-7xUKZT3TyNDHJByufV+aqF+UQN=8EUw_pF2y8D=JA@mail.gmail.com>

Thanks Orlando. Much appreciated.


On 25 April 2013 10:48, Orlando Richards <orlando.richards at ed.ac.uk> wrote:

> On 25/04/13 10:38, Pete Smith wrote:
>
>> Hi all
>>
>> Good to see lots of you at the user group meeting yesterday. Great work,
>> Jez!
>>
>> We're setting up a test cluster here at Realise, with a view to moving
>> our main storage over from Gluster.
>>
>> We're running the test cluster on Isilon hardware ... a couple of 1920
>> nodes that we were using for home dirs. Each node has dual gigabit
>> ethernet ports, and dual infiniband ports. Single dual-core Xeon proc
>> and and 4GB RAM. All good stuff and should make a nice test rig.
>>
>> I have a few questions!
>>
>> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
>> 3.3.blah to 3.5?
>> 2.  I'm having trouble assigning NSDs. I have a descfile which looks like:
>>
>> #DiskName:PrimaryServer:**BackupServer:DiskUsage:**
>> FailureGroup:DesiredName:**StoragePool
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>>
>> but the command
>>
>> "mmcrnsd -F /tmp/descfile -v no"
>>
>> just craps out with
>>
>> mmcrnsd: Processing disk sdc1
>> mmcrnsd: Node gpfs001.realisestudio.com
>> <http://gpfs001.realisestudio.**com <http://gpfs001.realisestudio.com>>
>> does not have a GPFS server license
>> designation.
>> mmcrnsd: Error found while checking disk descriptor
>> /dev/sdc1:gpfs001.**realisestudio.com::**dataAndMetadata:1
>> mmcrnsd: Command failed.  Examine previous error messages to determine
>> cause.
>>
>>
> mmchlicense server -N gpfs001.realisestudio.com should sort that one out.
>
>
>  Any help pointing me gently in the right direction would be much
>> appreciated. :-)
>>
>> TIA
>>
>> --
>> Pete Smith
>> DevOp/System Administrator
>> Realise Studio
>> 12/13 Poland Street, London W1F 8QB
>> T. +44 (0)20 7165 9644
>>
>> realisestudio.com <http://realisestudio.com>
>>
>>
>> ______________________________**_________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>
>>
>
> --
>             --
>    Dr Orlando Richards
>   Information Services
> IT Infrastructure Division
>        Unix Section
>     Tel: 0131 650 4994
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> ______________________________**_________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/**listinfo/gpfsug-discuss<http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>


-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130425/c132d89e/attachment-0003.htm>

From pete at realisestudio.com  Fri Apr 26 16:06:38 2013
From: pete at realisestudio.com (Pete Smith)
Date: Fri, 26 Apr 2013 16:06:38 +0100
Subject: [gpfsug-discuss] GPS Native RAID on linux?
Message-ID: <CAM9ZKkh0kGKKNnxBoogcN_y-TUyKGUypu-io8qfgs1rXrV0GnQ@mail.gmail.com>

Hi

I thought from the presentation that this was available on linux ... but
documentation seems to indicate IBM GSS only?

-- 
Pete Smith
DevOp/System Administrator
Realise Studio
12/13 Poland Street, London W1F 8QB
T. +44 (0)20 7165 9644

realisestudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20130426/872c1ad8/attachment-0003.htm>

From stuartb at 4gh.net  Tue Apr 30 21:50:38 2013
From: stuartb at 4gh.net (Stuart Barkley)
Date: Tue, 30 Apr 2013 16:50:38 -0400 (EDT)
Subject: [gpfsug-discuss] Test cluster - some questions
In-Reply-To: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
References: <CAM9ZKkjJ+wRXbY7ZOJMwYNmvfzuOCbd9qCGPT4jAvX592zJvWg@mail.gmail.com>
Message-ID: <alpine.BSF.2.00.1304301643540.4313@freeman.4gh.net>

On Thu, 25 Apr 2013 at 05:38 -0000, Pete Smith wrote:

> 1.  We're on centos6.4.x86_64. What's the easiest way to go from
> 3.3.blah to 3.5?

We are in transition to 3.5 on our original GPFS installation.  Two of
four servers are now at GPFS 3.4.XX/CentOS 6.4.  Two servers are still
at 3.3.YY/CentOS 5.4.  The compute nodes are all to 3.4.XX/CentOS 6.4.

The data center is remotely located and it is a pain to get physical
access.  Once we get the last two nodes upgraded, we expect to go to
GPFS 3.5 fairly quickly (we already have 3.5 running on a newer GPFS
installation).

My understanding is that you need to step through 3.4 during a
migration from 3.3 to 3.5.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone