[gpfsug-discuss] Change uidNumber and gidNumber for billions of files

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Tue Jun 9 12:20:45 BST 2020


On 08/06/2020 18:44, Lohit Valleru wrote:
> Hello Everyone,
> 
> We are planning to migrate from LDAP to AD, and one of the best solution 
> was to change the uidNumber and gidNumber to what SSSD or Centrify would 
> resolve.
> 
> May I know, if anyone has come across a tool/tools that can change the 
> uidNumbers and gidNumbers of billions of files efficiently and in a 
> reliable manner?

Not to my knowledge.

> We could spend some time to write a custom script, but wanted to know if 
> a tool already exists.
> 

If you can be sure that all files under a specific directory belong to a 
specific user and you have no ACL's then a whole bunch of "chown -R" 
would be reasonable. That is you have a lot of user home directories for 
example.

What I do in these scenarios is use a small sqlite database, say in this 
scenario which has the directory that I want to chown on, the target UID 
and GID and a status field. Initially I set the status field to -1 which 
indicates they have not been processed. The script sets the status field 
to -2 when it starts processing an entry and on completion sets the 
status field to the exit code of the command you are running. This way 
when the script is finished you can see any directory hierarchies that 
had a problem and if it dies early you can see where it got up to (that -2).

You can also do things like set all none zero status codes back to -1 
and run again with a simple SQL update on the database from the sqlite CLI.

If you don't need to modify ACL's but have mixed ownership under 
directory hierarchies then a script is reasonable but not a shell 
script. The overhead of execing chown billions of times on individual 
files will be astronomical. You need something like Perl or Python and 
make use of the builtin chown facilities of the language to avoid all 
those exec's. That said I suspect you will see a significant speed up 
from using C.

If you have ACL's to contend with then I would definitely spend some 
time and write some C code using the GPFS library. It will be a *LOT* 
faster than any script ever will be. Dealing with mmpgetacl and mmputacl 
in any script is horrendous and you will have billions of exec's of each 
command.

As I understand it GPFS stores each ACL once and each file then points 
to the ACL. Theoretically it would be possible to just modify the stored 
ACL's for a very speedy update of all the ACL's on the 
files/directories. However I would imagine you need to engage IBM and 
bend over while they empty your wallet for that option :-)

The biggest issue to take care of IMHO is do any of the input UID/GID 
numbers exist in the output set??? If so life just got a lot harder as 
you don't get a second chance to run the script/program if there is a 
problem.

In this case I would be very tempted to remove such clashes prior to the 
main change. You might be able to do that incrementally before the main 
switch and update your LDAP in to match.

Finally be aware that if you are using TSM for backup you will probably 
need to back every file up again after the change of ownership as far as 
I am aware.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG



More information about the gpfsug-discuss mailing list