[gpfsug-discuss] Odd behavior with cat followed by grep.

valdis.kletnieks at vt.edu valdis.kletnieks at vt.edu
Wed Feb 14 15:08:10 GMT 2018


On Wed, 14 Feb 2018 06:20:32 -0800, John Hanks said:

> #  ls -aln /srv/gsfs0/projects/pipetest.tmp.txt $HOME/pipetest.tmp.txt
> -rw-r--r-- 1 39073 3953 530721 Feb 14 06:10 /home/griznog/pipetest.tmp.txt
> -rw-r--r-- 1 39073 3001 530721 Feb 14 06:10
> /srv/gsfs0/projects/pipetest.tmp.txt
>
> We can "fix" the user case that exposed this by not using a temp file or
> inserting a sleep, but I'd still like to know why GPFS is behaving this way
> and make it stop.

May be related to replication, or other behind-the-scenes behavior.

Consider this example - 4.2.3.6, data and metadata replication both
set to 2, 2 sites 95 cable miles apart, each is 3 Dell servers with a full
fiberchannel mesh to 3 Dell MD34something arrays. 

% dd if=/dev/zero bs=1k count=4096 of=sync.test; ls -ls sync.test; sleep 5; ls -ls sync.test; sleep 5; ls -ls sync.test
4096+0 records in
4096+0 records out
4194304 bytes (4.2 MB) copied, 0.0342852 s, 122 MB/s
2048 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test

Notice that the first /bin/ls shouldn't be starting until after the dd has
completed - at which point it's only allocated half the blocks needed to hold
the 4M of data at one site.  5 seconds later, it's allocated the blocks at both
sites and thus shows the full 8M needed for 2 copies.

I've also seen (but haven't replicated it as I write this) a small file (4-8K
or so) showing first one full-sized block, then a second full-sized block, and
then dropping back to what's needed for 2 1/32nd fragments.  That had me
scratching my head 

Having said that, that's all metadata fun and games, while your case
appears to have some problems with data integrity (which is a whole lot
scarier).  It would be *really* nice if we understood the problem here.

The scariest part is:

> The first grep | wc -l returns 1, because grep outputs  "Binary file /path/to/
> gpfs/mount/test matches"

which seems to be implying that we're failing on semantic consistency.
Basically, your 'cat' command is completing and closing the file, but then a
temporally later open of the same find is reading something other that only the
just-written data.  My first guess is that it's a race condition similar to the
following: The cat command is causing a write on one NSD server, and the first
grep results in a read from a *different* NSD server, returning the data that
*used* to be in the block because the read actually happens before the first
NSD server actually completes the write.

It may be interesting to replace the grep's with pairs of 'ls -ls / dd' commands to grab the
raw data and its size, and check the following:

1) does the size (both blocks allocated and logical length) reported by
ls match the amount of data actually read by the dd?

2) Is the file length as actually read equal to the written length, or does it
overshoot and read all the way to the next block boundary?

3) If the length is correct, what's wrong with the data that's telling grep that
it's a binary file?  ( od -cx is your friend here).

4) If it overshoots, is the remainder all-zeros (good) or does it return semi-random
"what used to be there" data (bad, due to data exposure issues)?

(It's certainly not the most perplexing data consistency issue I've hit in 4 decades - the
winner *has* to be a intermittent data read corruption on a GPFS 3.5 cluster that
had us, IBM, SGI, DDN, and at least one vendor of networking gear all chasing our
tails for 18 months before we finally tracked it down. :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 486 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180214/98f7abf9/attachment-0002.sig>


More information about the gpfsug-discuss mailing list