[gpfsug-discuss] Bizarre fcntl locking behavior

Aaron Knister aaron.s.knister at nasa.gov
Thu Dec 6 18:47:05 GMT 2018


I've been trying to chase down an error one of our users periodically 
sees with Intel MPI. The body of the error is this:

This requires fcntl(2) to be implemented. As of 8/25/2011 it is not. 
Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd F,cmd 
F_SETLKW/7,type F_RDLCK/0,whence 0) with return value FFFFFFFF and errno 25.
- If the file system is NFS, you need to use NFS version 3, ensure that 
the lockd daemon is running on all the machines, and mount the directory 
with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted 
with the 'flock' option.
ADIOI_Set_lock:: No locks available
ADIOI_Set_lock:offset 0, length 8

When this happens, a new job is reading back-in the checkpoint files a 
previous job wrote. Consistently it's the reading in of previously 
written files that triggers this although the occurrence is sporadic and 
if the job retries enough times the error will go away.

The really curious thing, is there is only one byte range lock per file 
per-node open at any time, so the error 37 (I know it says 25 but that's 
actually in hex even though it's not prefixed with 0x) of being out of 
byte range locks is a little odd to me. The default is 200 but we should 
be no way near that.

I've been trying to frantically chase this down with various MPI 
reproducers but alas I came up short, until this morning, when I gave up 
on the MPI approach and tried something a little more simple. I've 
discovered that when:

- A file is opened by node A (a key requirement to reproduce seems to be 
that node A is *also* the metanode for the file. I've not been able to 
reproduce if node A is *not* the metanode)
- Node A Acquires a bunch of write locks in the file
- Node B then also acquires a bunch of write locks in the file
- Node B then acquires a bunch of read locks in the file
- Node A then also acquires a bunch of read locks in the file

At that last step, Node A will experience the errno 37 attempting to 
acquire read locks.

Here are the actual commands to reproduce this (source code for 
fcntl_stress.c is attached):

Node A: rm /gpfs/aaronFS/testFile; dd if=/dev/zero 
of=/gpfs/aaronFS/testFile bs=1M count=4000
Node A: ./fcntl_stress /gpfs/aaronFS/testFile $((1024*1024)) $((256*1024)) 1
Node B: ./fcntl_stress /gpfs/aaronFS/testFile $((1024*1024)) $((256*1024)) 1
Node B: ./fcntl_stress /gpfs/aaronFS/testFile $((1024*1024)) $((256*1024))
Node A: ./fcntl_stress /gpfs/aaronFS/testFile $((1024*1024)) $((256*1024))

Now that I've typed this out, I realize this really should be a PMR not 
a post to the mailing list :) but I thought it was interesting and 
wanted to share.

-Aaron

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
-------------- next part --------------
/*
Aaron Knister <aaron.s.knister at nasa.gov>
Program to acquire a bunch of byte range locks in a file
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <assert.h>
#include <string.h>
#include <errno.h>

int main(int argc, char **argv) {
	char *filename;
	int fd;
	struct stat statBuf;
	int highRand;
	int lowRand;
	unsigned int l_start = 0;
	unsigned int l_len;
	int openMode;
	int lockType;
	struct flock lock;
	unsigned int stride;

	filename = argv[1];
	stride = atoi(argv[2]);
	l_len = atoi(argv[3]);

	if ( argc > 4 ) {
		openMode = O_WRONLY;
		lockType = F_WRLCK;
	} else {
		openMode = O_RDONLY;
		lockType = F_RDLCK;
	}

	printf("Opening file '%s' in %s mode. stride = %d. l_len = %d\n", filename, (openMode == O_WRONLY) ? "write" : "read", stride, l_len);

	assert( (fd = open(filename, openMode)) >= 0 );

	assert( fstat(fd, &statBuf) == 0 );

	while(1) {
		if ( l_start >= statBuf.st_size ) {
			break;
			l_start = 0;
		}

		highRand = rand();
		lowRand = rand();

		lock.l_type = lockType;
		lock.l_whence = 0;
		lock.l_start = l_start;
		lock.l_len = l_len;

		if (fcntl(fd, F_SETLKW, &lock) != 0) {
			fprintf(stderr, "Non-zero return from fcntl. errno = %d (%s)\n", errno, strerror(errno));
			abort();
		}

		lock.l_type = F_UNLCK;
		lock.l_whence = 0;
		lock.l_start = l_start;
		lock.l_len = l_len;

		assert(fcntl(fd, F_SETLKW, &lock) != -1);
		l_start += stride;
	}
}	


More information about the gpfsug-discuss mailing list