[gpfsug-discuss] data integrity documentation

Wed Aug 2 22:47:36 BST 2017

hi sven,

> ok, you can't be any newer that that. i just wonder why you have 512b
> inodes if this is a new system ?
because we rsynced 100M files to it ;) it's supposed to replace another
system.

> are this raw disks in this setup or raid controllers ? 

raid (DDP on MD3460)
> whats the disk sector size 
euhm, you mean the luns?
for metadata disks (SSD in raid 1):
> # parted /dev/mapper/f1v01e0g0_Dm01o0
> GNU Parted 3.1
> Using /dev/mapper/f1v01e0g0_Dm01o0
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p                                                                
> Model: Linux device-mapper (multipath) (dm)
> Disk /dev/mapper/f1v01e0g0_Dm01o0: 219GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
> Disk Flags: 
> 
> Number  Start   End    Size   File system  Name   Flags
>  1      24.6kB  219GB  219GB               GPFS:  hidden

for data disks (DDP)
> [root at nsd01 ~]# parted /dev/mapper/f1v01e0p0_S17o0
> GNU Parted 3.1
> Using /dev/mapper/f1v01e0p0_S17o0
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p                                                                
> Model: Linux device-mapper (multipath) (dm)
> Disk /dev/mapper/f1v01e0p0_S17o0: 35.2TB
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
> Disk Flags: 
> 
> Number  Start   End     Size    File system  Name   Flags
>  1      24.6kB  35.2TB  35.2TB               GPFS:  hidden
> 
> (parted) q


and how was the filesystem created (mmlsfs FSNAME would show
> answer to the last question)


> # mmlsfs somefilesystem
> flag                value                    description
> ------------------- ------------------------ -----------------------------------
>  -f                 16384                    Minimum fragment size in bytes (system pool)
>                     262144                   Minimum fragment size in bytes (other pools)
>  -i                 4096                     Inode size in bytes
>  -I                 32768                    Indirect block size in bytes
>  -m                 2                        Default number of metadata replicas
>  -M                 2                        Maximum number of metadata replicas
>  -r                 1                        Default number of data replicas
>  -R                 2                        Maximum number of data replicas
>  -j                 scatter                  Block allocation type
>  -D                 nfs4                     File locking semantics in effect
>  -k                 all                      ACL semantics in effect
>  -n                 850                      Estimated number of nodes that will mount file system
>  -B                 524288                   Block size (system pool)
>                     8388608                  Block size (other pools)
>  -Q                 user;group;fileset       Quotas accounting enabled
>                     user;group;fileset       Quotas enforced
>                     none                     Default quotas enabled
>  --perfileset-quota Yes                      Per-fileset quota enforcement
>  --filesetdf        Yes                      Fileset df enabled?
>  -V                 17.00 (4.2.3.0)          File system version
>  --create-time      Wed May 31 12:54:00 2017 File system creation time
>  -z                 No                       Is DMAPI enabled?
>  -L                 4194304                  Logfile size
>  -E                 No                       Exact mtime mount option
>  -S                 No                       Suppress atime mount option
>  -K                 whenpossible             Strict replica allocation option
>  --fastea           Yes                      Fast external attributes enabled?
>  --encryption       No                       Encryption enabled?
>  --inode-limit      313524224                Maximum number of inodes in all inode spaces
>  --log-replicas     0                        Number of log replicas
>  --is4KAligned      Yes                      is4KAligned?
>  --rapid-repair     Yes                      rapidRepair enabled?
>  --write-cache-threshold 0                   HAWC Threshold (max 65536)
>  --subblocks-per-full-block 32               Number of subblocks per full block
>  -P                 system;MD3260            Disk storage pools in file system
>  -d                 f0v00e0g0_Sm00o0;f0v00e0p0_S00o0;f1v01e0g0_Sm01o0;f1v01e0p0_S01o0;f0v02e0g0_Sm02o0;f0v02e0p0_S02o0;f1v03e0g0_Sm03o0;f1v03e0p0_S03o0;f0v04e0g0_Sm04o0;f0v04e0p0_S04o0;
>  -d                 f1v05e0g0_Sm05o0;f1v05e0p0_S05o0;f0v06e0g0_Sm06o0;f0v06e0p0_S06o0;f1v07e0g0_Sm07o0;f1v07e0p0_S07o0;f0v00e0g0_Sm08o1;f0v00e0p0_S08o1;f1v01e0g0_Sm09o1;f1v01e0p0_S09o1;
>  -d                 f0v02e0g0_Sm10o1;f0v02e0p0_S10o1;f1v03e0g0_Sm11o1;f1v03e0p0_S11o1;f0v04e0g0_Sm12o1;f0v04e0p0_S12o1;f1v05e0g0_Sm13o1;f1v05e0p0_S13o1;f0v06e0g0_Sm14o1;f0v06e0p0_S14o1;
>  -d                 f1v07e0g0_Sm15o1;f1v07e0p0_S15o1;f0v00e0p0_S16o0;f1v01e0p0_S17o0;f0v02e0p0_S18o0;f1v03e0p0_S19o0;f0v04e0p0_S20o0;f1v05e0p0_S21o0;f0v06e0p0_S22o0;f1v07e0p0_S23o0;
>  -d                 f0v00e0p0_S24o1;f1v01e0p0_S25o1;f0v02e0p0_S26o1;f1v03e0p0_S27o1;f0v04e0p0_S28o1;f1v05e0p0_S29o1;f0v06e0p0_S30o1;f1v07e0p0_S31o1  Disks in file system
>  -A                 no                       Automatic mount option
>  -o                 none                     Additional mount options
>  -T                 /scratch          Default mount point
>  --mount-priority   0   


> 
> on the tsdbfs i am not sure if it gave wrong results, but it would be worth
> a test to see whats actually on the disk .
ok. i'll try this tomorrow.

> 
> you are correct that GNR extends this to the disk, but the network part is
> covered by the nsdchecksums you turned on
> when you enable the not to be named checksum parameter do you actually
> still get an error from fsck ?
hah, no, we don't. mmfsck says the filesystem is clean. we found this
odd, so we already asked ibm support about this but no answer yet.

stijn

> 
> sven
> 
> 
> On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt <stijn.deweirdt at ugent.be>
> wrote:
> 
>> hi sven,
>>
>>> before i answer the rest of your questions, can you share what version of
>>> GPFS exactly you are on mmfsadm dump version would be best source for
>> that.
>> it returns
>> Build branch "4.2.3.3 ".
>>
>>> if you have 2 inodes and you know the exact address of where they are
>>> stored on disk one could 'dd' them of the disk and compare if they are
>>> really equal.
>> ok, i can try that later. are you suggesting that the "tsdbfs comp"
>> might gave wrong results? because we ran that and got eg
>>
>>> # tsdbfs somefs comp 7:5137408 25:221785088 1024
>>> Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 =
>> 0x19:D382C00:
>>>   All sectors identical
>>
>>
>>> we only support checksums when you use GNR based systems, they cover
>>> network as well as Disk side for that.
>>> the nsdchecksum code you refer to is the one i mentioned above thats only
>>> supported with GNR at least i am not aware that we ever claimed it to be
>>> supported outside of it, but i can check that.
>> ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one,
>> and they are not in the same gpfs cluster.
>>
>> i thought the GNR extended the checksumming to disk, and that it was
>> already there for the network part. thanks for clearing this up. but
>> that is worse then i thought...
>>
>> stijn
>>
>>>
>>> sven
>>>
>>> On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt <stijn.deweirdt at ugent.be
>>>
>>> wrote:
>>>
>>>> hi sven,
>>>>
>>>> the data is not corrupted. mmfsck compares 2 inodes, says they don't
>>>> match, but checking the data with tbdbfs reveals they are equal.
>>>> (one replica has to be fetched over the network; the nsds cannot access
>>>> all disks)
>>>>
>>>> with some nsdChksum... settings we get during this mmfsck a lot of
>>>> "Encountered XYZ checksum errors on network I/O to NSD Client disk"
>>>>
>>>> ibm support says these are hardware issues, but wrt to mmfsck false
>>>> positives.
>>>>
>>>> anyway, our current question is: if these are hardware issues, is there
>>>> anything in gpfs client->nsd (on the network side) that would detect
>>>> such errors. ie can we trust the data (and metadata).
>>>> i was under the impression that client to disk is not covered, but i
>>>> assumed that at least client to nsd (the network part) was checksummed.
>>>>
>>>> stijn
>>>>
>>>>
>>>> On 08/02/2017 09:10 PM, Sven Oehme wrote:
>>>>> ok, i think i understand now, the data was already corrupted. the
>> config
>>>>> change i proposed only prevents a potentially known future on the wire
>>>>> corruption, this will not fix something that made it to the disk
>> already.
>>>>>
>>>>> Sven
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt <
>> stijn.deweirdt at ugent.be
>>>>>
>>>>> wrote:
>>>>>
>>>>>> yes ;)
>>>>>>
>>>>>> the system is in preproduction, so nothing that can't stopped/started
>> in
>>>>>> a few minutes (current setup has only 4 nsds, and no clients).
>>>>>> mmfsck triggers the errors very early during inode replica compare.
>>>>>>
>>>>>>
>>>>>> stijn
>>>>>>
>>>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote:
>>>>>>> How can you reproduce this so quick ?
>>>>>>> Did you restart all daemons after that ?
>>>>>>>
>>>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt <
>> stijn.deweirdt at ugent.be
>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi sven,
>>>>>>>>
>>>>>>>>
>>>>>>>>> the very first thing you should check is if you have this setting
>>>> set :
>>>>>>>> maybe the very first thing to check should be the faq/wiki that has
>>>> this
>>>>>>>> documented?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> mmlsconfig envVar
>>>>>>>>>
>>>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF
>> 1
>>>>>>>>> MLX5_USE_MUTEX 1
>>>>>>>>>
>>>>>>>>> if that doesn't come back the way above you need to set it :
>>>>>>>>>
>>>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1
>>>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1"
>>>>>>>> i just set this (wasn't set before), but problem is still present.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> there was a problem in the Mellanox FW in various versions that was
>>>>>> never
>>>>>>>>> completely addressed (bugs where found and fixed, but it was never
>>>>>> fully
>>>>>>>>> proven to be addressed) the above environment variables turn code
>> on
>>>> in
>>>>>>>> the
>>>>>>>>> mellanox driver that prevents this potential code path from being
>>>> used
>>>>>> to
>>>>>>>>> begin with.
>>>>>>>>>
>>>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in
>>>>>> Scale
>>>>>>>>> that even you don't set this variables the problem can't happen
>>>> anymore
>>>>>>>>> until then the only choice you have is the envVar above (which btw
>>>>>> ships
>>>>>>>> as
>>>>>>>>> default on all ESS systems).
>>>>>>>>>
>>>>>>>>> you also should be on the latest available Mellanox FW & Drivers as
>>>> not
>>>>>>>> all
>>>>>>>>> versions even have the code that is activated by the environment
>>>>>>>> variables
>>>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't
>>>> remember
>>>>>>>> the
>>>>>>>>> exact version. There had been multiple defects opened around this
>>>> area,
>>>>>>>> the
>>>>>>>>> last one i remember was  :
>>>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards
>> from
>>>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to
>> make
>>>>>>>> new one. mellanox used to allow to make your own, but they don't
>>>>>> anymore.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on
>>>>>>>> pthread_spin_lock
>>>>>>>>>
>>>>>>>>> you may ask your mellanox representative if they can get you access
>>>> to
>>>>>>>> this
>>>>>>>>> defect. while it was found on ESS , means on PPC64 and with
>>>> ConnectX-3
>>>>>>>>> cards its a general issue that affects all cards and on intel as
>> well
>>>>>> as
>>>>>>>>> Power.
>>>>>>>> ok, thanks for this. maybe such a reference is enough for dell to
>>>> update
>>>>>>>> their firmware.
>>>>>>>>
>>>>>>>> stijn
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt <
>>>>>> stijn.deweirdt at ugent.be>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> hi all,
>>>>>>>>>>
>>>>>>>>>> is there any documentation wrt data integrity in spectrum scale:
>>>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data
>>>>>> written
>>>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from
>>>> the
>>>>>>>>>> nsd gpfs daemon to disk.
>>>>>>>>>>
>>>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it
>> the
>>>>>>>> same?
>>>>>>>>>>
>>>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says
>>>> it's
>>>>>>>>>> network issue; and we see no errors anywhere...)
>>>>>>>>>>
>>>>>>>>>> thanks a lot,
>>>>>>>>>>
>>>>>>>>>> stijn
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gpfsug-discuss mailing list
>>>>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gpfsug-discuss mailing list
>>>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gpfsug-discuss mailing list
>>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gpfsug-discuss mailing list
>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gpfsug-discuss mailing list
>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at spectrumscale.org
>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>