[gpfsug-discuss] Change uidNumber and gidNumber for billions of files

Stephen Ulmer ulmer at ulmer.org
Tue Jun 9 14:07:32 BST 2020


Jonathan brings up a good point that you’ll only get one shot at this — if you’re using the file system as your record of who owns what.

You might want to use the policy engine to record the existing file names and ownership (and then provide updates using the same policy engine for the things that changed after the last time you ran it). At that point, you’ve got the list of who should own what from before you started.

You could even do some things to see how complex your problem is, like "how many directories have files owned by more than one UID?”

With respect to that, it is surprising how easy the sqlite C API is to use (though I would still recommend Perl or Python), and equally surprising how *bad* the JOIN performance is. If you go with sqlite, denormalize *everything* as it’s collected. If that is too dirty for you, then just use MariaDB or something else.


-- 
Stephen



> On Jun 9, 2020, at 7:20 AM, Jonathan Buzzard <jonathan.buzzard at strath.ac.uk> wrote:
> 
> On 08/06/2020 18:44, Lohit Valleru wrote:
>> Hello Everyone,
>> We are planning to migrate from LDAP to AD, and one of the best solution was to change the uidNumber and gidNumber to what SSSD or Centrify would resolve.
>> May I know, if anyone has come across a tool/tools that can change the uidNumbers and gidNumbers of billions of files efficiently and in a reliable manner?
> 
> Not to my knowledge.
> 
>> We could spend some time to write a custom script, but wanted to know if a tool already exists.
> 
> If you can be sure that all files under a specific directory belong to a specific user and you have no ACL's then a whole bunch of "chown -R" would be reasonable. That is you have a lot of user home directories for example.
> 
> What I do in these scenarios is use a small sqlite database, say in this scenario which has the directory that I want to chown on, the target UID and GID and a status field. Initially I set the status field to -1 which indicates they have not been processed. The script sets the status field to -2 when it starts processing an entry and on completion sets the status field to the exit code of the command you are running. This way when the script is finished you can see any directory hierarchies that had a problem and if it dies early you can see where it got up to (that -2).
> 
> You can also do things like set all none zero status codes back to -1 and run again with a simple SQL update on the database from the sqlite CLI.
> 
> If you don't need to modify ACL's but have mixed ownership under directory hierarchies then a script is reasonable but not a shell script. The overhead of execing chown billions of times on individual files will be astronomical. You need something like Perl or Python and make use of the builtin chown facilities of the language to avoid all those exec's. That said I suspect you will see a significant speed up from using C.
> 
> If you have ACL's to contend with then I would definitely spend some time and write some C code using the GPFS library. It will be a *LOT* faster than any script ever will be. Dealing with mmpgetacl and mmputacl in any script is horrendous and you will have billions of exec's of each command.
> 
> As I understand it GPFS stores each ACL once and each file then points to the ACL. Theoretically it would be possible to just modify the stored ACL's for a very speedy update of all the ACL's on the files/directories. However I would imagine you need to engage IBM and bend over while they empty your wallet for that option :-)
> 
> The biggest issue to take care of IMHO is do any of the input UID/GID numbers exist in the output set??? If so life just got a lot harder as you don't get a second chance to run the script/program if there is a problem.
> 
> In this case I would be very tempted to remove such clashes prior to the main change. You might be able to do that incrementally before the main switch and update your LDAP in to match.
> 
> Finally be aware that if you are using TSM for backup you will probably need to back every file up again after the change of ownership as far as I am aware.
> 
> JAB.
> 
> -- 
> Jonathan A. Buzzard                         Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200609/41d71ab9/attachment-0002.htm>


More information about the gpfsug-discuss mailing list