[gpfsug-discuss] CTDB woes

Wed Apr 17 11:30:32 BST 2013

Hi All - an update to this,

After re-initialising the databases on Monday, things did seem to be 
running better, but ultimately we got back to suffering from spikes in 
ctdb processes and corresponding "pauses" in service. We fell back to a 
single node again for Tuesday (and things were stable once again), and 
this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was 
rebuilt against CTDB 1.2.61 headers).

Things seem to be stable for now - more so than on Monday.

For the record - one metric I'm watching is the number of ctdb processes 
running (this would spike to > 1000 under the failure conditions). It's 
currently sitting consistently at 3 processes, with occasional blips of 
5-7 processes.

--
Orlando

On 15/04/13 10:54, Orlando Richards wrote:
> On 12/04/13 19:44, Vic Cornell wrote:
>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>
> No - but considered it. However, the only "live" CTDB file that sits on
> GPFS is the reclock file, which - I think - is only used as the
> heartbeat between nodes and for the recovery process. Now, there's
> mileage in insulating that, certainly, but I don't think that's what
> we're suffering from here.
>
> On a positive note - we took the steps this morning to re-initialise the
> ctdb databases from current data, and things seem to be stable today so
> far.
>
> Basically - shut down ctdb on all but one node. On all but that node, do:
> mv /var/ctdb/ /var/ctdb.save.date
>
> then start up ctdb on those nodes. Once they've come up, shut down ctdb
> on the last node, move /var/ctdb out the way, and restart. That brings
> them all up with freshly compacted databases.
>
> Also, from the samba-technical mailing list came the advice to use a
> more recent ctdb - specifically, 1.2.61. I've got that built and ready
> to go (and a rebuilt samba compiled against it too), but if things prove
> to be stable after today's compacting, then we will probably leave it at
> that and not deploy this.
>
> Interesting that 2.0 wasn't suggested for "stable", and that the current
> "dev" version is 2.1.
>
> For reference, here's the start of the thread:
> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>
> --
> Orlando.
>
>
>
>>
>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>> wrote:
>>
>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>> Hi Orlando,
>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>
>>>> ctdb - 1.0.99
>>>> samba 3.5.15
>>>>
>>>> Both compiled from source. We have about 300+ users normally.
>>>>
>>>
>>> We have suspicions that 3.6 has put additional "chatter" into the
>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>> has found that the clustered locking databases, in particular, prove
>>> to be a scalability/usability limit for ctdb.
>>>
>>>
>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>> bad moments over the last year . These have gone away since we have
>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>> There
>>>> have been no issues with samba/ctdb
>>>>
>>>> The only comment I can make is that during initial investigations into
>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>> with error messages like:
>>>>
>>>>   configure: checking whether cluster support is available
>>>> checking for ctdb.h... yes
>>>> checking for ctdb_private.h... yes
>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>> configure: error: "cluster support not available: support for
>>>> SCHEDULE_FOR_DELETION control missing"
>>>>
>>>>
>>>> What occurs to me is that this message seems to indicate that it is
>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>> help, of course it might not and make backing out harder.
>>>
>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>> The versioning in CTDB has proved hard for me to fathom...
>>>
>>>>
>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>> meeting first!
>>>>
>>>
>>> It has to be said - the timing is good!
>>> Cheers,
>>> Orlando
>>>
>>>>
>>>> Thanks
>>>>
>>>> Bob
>>>>
>>>>
>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>
>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>
>>>>     We've long been using CTDB and Samba for our NAS service, servicing
>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>     performance over the last few weeks, likely triggered either by an
>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>> result),
>>>>     or possibly by additional users coming on with a new workload.
>>>>
>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>     that a roll back would fix the issue).
>>>>
>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>> for
>>>>     10-60 seconds, and on the servers a corresponding spawning of large
>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>> bang",
>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>
>>>>     We also serve up NFS from the same ctdb-managed frontends, and GPFS
>>>>     from the cluster - and these are both fine throughout.
>>>>
>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>     and everything started behaving fine - which is where we are now.
>>>>
>>>>     So, I've got a bunch of questions!
>>>>
>>>>       - does anyone know why ctdb would be spawning these processes,
>>>> and
>>>>     if there's anything we can do to stop it needing to do it?
>>>>       - has anyone done any more general performance / config
>>>>     optimisation of CTDB?
>>>>
>>>>     And - more generally - does anyone else actually use
>>>> ctdb/samba/gpfs
>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>
>>>>
>>>>     --
>>>>                  --
>>>>         Dr Orlando Richards
>>>>        Information Services
>>>>     IT Infrastructure Division
>>>>             Unix Section
>>>>          Tel: 0131 650 4994
>>>>
>>>>     The University of Edinburgh is a charitable body, registered in
>>>>     Scotland, with registration number SC005336.
>>>>     _________________________________________________
>>>>     gpfsug-discuss mailing list
>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Bob Cregan
>>>>
>>>> Senior Storage Systems Administrator
>>>>
>>>> ACRC
>>>>
>>>> Bristol University
>>>>
>>>> Tel:     +44 (0) 117 331 4406
>>>>
>>>> skype:  bobcregan
>>>>
>>>> Mobile: +44 (0) 7712388129
>>>>
>>>
>>>
>>> --
>>>             --
>>>    Dr Orlando Richards
>>>   Information Services
>>> IT Infrastructure Division
>>>        Unix Section
>>>     Tel: 0131 650 4994
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>

-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.