[gpfsug-discuss] data integrity documentation

Stijn De Weirdt stijn.deweirdt at ugent.be
Wed Aug 2 22:14:45 BST 2017


hi sven,

> before i answer the rest of your questions, can you share what version of
> GPFS exactly you are on mmfsadm dump version would be best source for that.
it returns
Build branch "4.2.3.3 ".

> if you have 2 inodes and you know the exact address of where they are
> stored on disk one could 'dd' them of the disk and compare if they are
> really equal.
ok, i can try that later. are you suggesting that the "tsdbfs comp"
might gave wrong results? because we ran that and got eg

> # tsdbfs somefs comp 7:5137408 25:221785088 1024
> Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = 0x19:D382C00:
>   All sectors identical


> we only support checksums when you use GNR based systems, they cover
> network as well as Disk side for that.
> the nsdchecksum code you refer to is the one i mentioned above thats only
> supported with GNR at least i am not aware that we ever claimed it to be
> supported outside of it, but i can check that.
ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one,
and they are not in the same gpfs cluster.

i thought the GNR extended the checksumming to disk, and that it was
already there for the network part. thanks for clearing this up. but
that is worse then i thought...

stijn

> 
> sven
> 
> On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt <stijn.deweirdt at ugent.be>
> wrote:
> 
>> hi sven,
>>
>> the data is not corrupted. mmfsck compares 2 inodes, says they don't
>> match, but checking the data with tbdbfs reveals they are equal.
>> (one replica has to be fetched over the network; the nsds cannot access
>> all disks)
>>
>> with some nsdChksum... settings we get during this mmfsck a lot of
>> "Encountered XYZ checksum errors on network I/O to NSD Client disk"
>>
>> ibm support says these are hardware issues, but wrt to mmfsck false
>> positives.
>>
>> anyway, our current question is: if these are hardware issues, is there
>> anything in gpfs client->nsd (on the network side) that would detect
>> such errors. ie can we trust the data (and metadata).
>> i was under the impression that client to disk is not covered, but i
>> assumed that at least client to nsd (the network part) was checksummed.
>>
>> stijn
>>
>>
>> On 08/02/2017 09:10 PM, Sven Oehme wrote:
>>> ok, i think i understand now, the data was already corrupted. the config
>>> change i proposed only prevents a potentially known future on the wire
>>> corruption, this will not fix something that made it to the disk already.
>>>
>>> Sven
>>>
>>>
>>>
>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt <stijn.deweirdt at ugent.be
>>>
>>> wrote:
>>>
>>>> yes ;)
>>>>
>>>> the system is in preproduction, so nothing that can't stopped/started in
>>>> a few minutes (current setup has only 4 nsds, and no clients).
>>>> mmfsck triggers the errors very early during inode replica compare.
>>>>
>>>>
>>>> stijn
>>>>
>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote:
>>>>> How can you reproduce this so quick ?
>>>>> Did you restart all daemons after that ?
>>>>>
>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt <stijn.deweirdt at ugent.be
>>>
>>>>> wrote:
>>>>>
>>>>>> hi sven,
>>>>>>
>>>>>>
>>>>>>> the very first thing you should check is if you have this setting
>> set :
>>>>>> maybe the very first thing to check should be the faq/wiki that has
>> this
>>>>>> documented?
>>>>>>
>>>>>>>
>>>>>>> mmlsconfig envVar
>>>>>>>
>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1
>>>>>>> MLX5_USE_MUTEX 1
>>>>>>>
>>>>>>> if that doesn't come back the way above you need to set it :
>>>>>>>
>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1
>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1"
>>>>>> i just set this (wasn't set before), but problem is still present.
>>>>>>
>>>>>>>
>>>>>>> there was a problem in the Mellanox FW in various versions that was
>>>> never
>>>>>>> completely addressed (bugs where found and fixed, but it was never
>>>> fully
>>>>>>> proven to be addressed) the above environment variables turn code on
>> in
>>>>>> the
>>>>>>> mellanox driver that prevents this potential code path from being
>> used
>>>> to
>>>>>>> begin with.
>>>>>>>
>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in
>>>> Scale
>>>>>>> that even you don't set this variables the problem can't happen
>> anymore
>>>>>>> until then the only choice you have is the envVar above (which btw
>>>> ships
>>>>>> as
>>>>>>> default on all ESS systems).
>>>>>>>
>>>>>>> you also should be on the latest available Mellanox FW & Drivers as
>> not
>>>>>> all
>>>>>>> versions even have the code that is activated by the environment
>>>>>> variables
>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't
>> remember
>>>>>> the
>>>>>>> exact version. There had been multiple defects opened around this
>> area,
>>>>>> the
>>>>>>> last one i remember was  :
>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from
>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make
>>>>>> new one. mellanox used to allow to make your own, but they don't
>>>> anymore.
>>>>>>
>>>>>>>
>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on
>>>>>> pthread_spin_lock
>>>>>>>
>>>>>>> you may ask your mellanox representative if they can get you access
>> to
>>>>>> this
>>>>>>> defect. while it was found on ESS , means on PPC64 and with
>> ConnectX-3
>>>>>>> cards its a general issue that affects all cards and on intel as well
>>>> as
>>>>>>> Power.
>>>>>> ok, thanks for this. maybe such a reference is enough for dell to
>> update
>>>>>> their firmware.
>>>>>>
>>>>>> stijn
>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt <
>>>> stijn.deweirdt at ugent.be>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi all,
>>>>>>>>
>>>>>>>> is there any documentation wrt data integrity in spectrum scale:
>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data
>>>> written
>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from
>> the
>>>>>>>> nsd gpfs daemon to disk.
>>>>>>>>
>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the
>>>>>> same?
>>>>>>>>
>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says
>> it's
>>>>>>>> network issue; and we see no errors anywhere...)
>>>>>>>>
>>>>>>>> thanks a lot,
>>>>>>>>
>>>>>>>> stijn
>>>>>>>> _______________________________________________
>>>>>>>> gpfsug-discuss mailing list
>>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gpfsug-discuss mailing list
>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gpfsug-discuss mailing list
>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at spectrumscale.org
>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 



More information about the gpfsug-discuss mailing list