[gpfsug-discuss] 4K sector NSD support (was: Hardware refresh)

Yuri L Volobuev volobuev at us.ibm.com
Wed Oct 12 06:44:40 BST 2016


Yes, it is possible to add a 4KN dataOnly NSD to a non-4K-aligned file
system, as you figured out.  This is something we didn't plan on doing
originally, but then had to implement based on the feedback from the field.
There's clearly a need for this.  However, this statement is exactly it --
dataOnly NSDs.  The only way to put metadata on a 4KN disk is to use a
4K-aligned file system.  There are several kinds of metadata present in
non-4K-aligned file system that generate non-4K IOs (512 byte inodes being
the biggest problem), and there's no way to work around this short of using
the new format, and there's no way to perform a conversion to the new
format in-place.

You're welcome to submit an RFE, of course, but I'd recommend being
pragmatic about the chances of such an RFE being implemented.  As you can
imagine, the main reason why an all-encompassing file system conversion
tool doesn't exist is not GPFS developers having no idea that such a tool
is wanted.  There are several considerations that conspire to make this an
unlikely candidate to ever be implemented:
1) The task is hard and has no finish line.  In most GPFS releases,
something changes, necessitating an added piece of work for the
hypothetical conversion tool, and the matrix of from-to format version
combinations gets to be big quite quickly.
2) A file system conversion is something that is needed very infrequently,
but when this code does run, it absolutely has to run and run perfectly,
else the result would be a half-converted file system, i.e. a royal mess.
This is a tester's nightmare.
3) The failure scenarios are all unpalatable.  What should the conversion
tool do if it runs out of space replacing smaller metadata structures with
bigger ones?  Undoing a partially finished conversion is even harder than
doing it in the first place.
4) Doing an on-disk conversion on-line is simply very difficult.  Consider
the task of converting an inode file to use a different inode size.  The
file can be huge (billions of records), and it would take a fair chunk of
time to rewrite it, but the file is changing while it's being converted
(can't simply lock the whole thing down for so long), simultaneously on
multiple nodes.  Orchestrating the processing of updates in the presence of
two inode files, with proper atomicity guarantees (to guard against a node
failure) is a task of considerable complexity.

None of this means the task is impossible, of course.  It is, however, a
very big chunk of very complex work, all towards a tool that on an average
cluster may run somewhere between zero and one times, not something that
benefits day-to-day operations.  Where the complexity of the task allows
for a reasonably affordable implementation, e.g. conversion from an
old-style EA file to the FASTEA format, a conversion tool has been
implemented (mmmigratefs).  However, doing this for every single changed
aspect of the file system format is simply too expensive to justify, given
other tasks in front of us.

On the other hand, a well-implemented migration mechanism solves the file
system reformatting scenario (which covers all aspects of file system
format changes) as well as a number of other scenarios.  This is a cleaner,
more general solution.  Migration doesn't have to mean an outage.  A simple
rsync-based migration requires downtime for a cutover, while an AFM-based
migration doesn't necessarily require one.  I'm not saying that GPFS has a
particularly strong migration story at the moment, but this is a much more
productive direction for applying resources than a mythical
all-encompassing conversion tool.

yuri



From:	Aaron Knister <aaron.s.knister at nasa.gov>
To:	<gpfsug-discuss at spectrumscale.org>,
Date:	10/11/2016 05:59 PM
Subject:	Re: [gpfsug-discuss] 4K sector NSD support (was: Hardware
            refresh)
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Yuri,

(Sorry for being somewhat spammy) I now understand the limitation after
some more testing (I'm a hands-on learner, can you tell?). Given the
right code/cluster/fs version levels I can add 4K dataOnly NSDv2 NSDs to
a filesystem created with NSDv1 NSDs. What I can't do is seemingly add
any metadataOnly or dataAndMetadata 4K luns to an fs that is not 4K
aligned which I assume would be any fs originally created with NSDv1
LUNs. It seems possible to move all data away from NSDv1 LUNS in a
filesystem behind-the-scenes using GPFS migration tools, and move the
data to NSDv2 LUNs. In this case I believe what's missing is a tool to
convert just the metadata structures to be 4K aligned since the data
would already on 4k-based NSDv2 LUNS, is that the case? I'm trying to
figure out what exactly I'm asking for in an RFE.

-Aaron

On 10/11/16 7:57 PM, Aaron Knister wrote:
> I think I was a little quick to the trigger. I re-read your last mail
> after doing some testing and understand it differently. I was wrong
> about my interpretation-- you can add 4K NSDv2 formatted NSDs to a
> filesystem previously created with NSDv1 NSDs assuming, as you say, the.
> minReleaseLevel and filesystem version are high enough. That negates
> about half of my last e-mail. The fs still doesn't show as 4K aligned:
>
> loressd01:~ # /usr/lpp/mmfs/bin/mmlsfs tnb4k --is4KAligned
> flag                value                    description
> ------------------- ------------------------
> -----------------------------------
>  --is4KAligned      No                       is4KAligned?
>
> but *shrug* most of the I/O to these disks should be 1MB anyway. If
> somebody is pounding the FS with smaller than 4K I/O they're gonna get a
> talkin' to.
>
> -Aaron
>
> On 10/11/16 6:41 PM, Aaron Knister wrote:
>> Thanks Yuri.
>>
>> I'm asking for my own purposes but I think it's still relevant here:
>> we're still at GPFS 3.5 and will be adding dataOnly NSDs with 4K sectors
>> in the near future. We're planning to update to 4.1 before we format
>> these NSDs, though. If I understand you correctly we can't bring these
>> 4K NSDv2 NSDs into a filesystem with 512b-based NSDv1 NSDs? That's a
>> pretty big deal :(
>>
>> Reformatting every few years with 10's of petabytes of data is not
>> realistic for us (it would take years to move the data around). It also
>> goes against my personal preachings about GPFS's storage virtualization
>> capabilities: the ability to perform upgrades/make underlying storage
>> infrastructure changes with behind-the-scenes data migration,
>> eliminating much of the manual hassle of storage administrators doing
>> rsync dances. I guess it's RFE time? It also seems as though AFM could
>> help with automating the migration, although many of our filesystems do
>> not have filesets on them so we would have to re-think how we lay out
>> our filesystems.
>>
>> This is also curious to me with IBM pitching GPFS as a filesystem for
>> cloud services (the cloud *never* goes down, right?). Granted I believe
>> this pitch started after the NSDv2 format was defined, but if somebody
>> is building a large cloud with GPFS as the underlying filesystem for an
>> object or an image store one might think the idea of having to re-format
>> the filesystem to gain access to critical new features is inconsistent
>> with this pitch. It would be hugely impactful. Just my $.02.
>>
>> As you can tell, I'm frustrated there's no online conversion tool :) Not
>> that there couldn't be... you all are brilliant developers.
>>
>> -Aaron
>>
>> On 10/11/16 1:22 PM, Yuri L Volobuev wrote:
>>> This depends on the committed cluster version level (minReleaseLevel)
>>> and file system format. Since NFSv2 is an on-disk format change, older
>>> code wouldn't be able to understand what it is, and thus if there's a
>>> possibility of a downlevel node looking at the NSD, the NFSv1 format is
>>> going to be used. The code does NSDv1<->NSDv2 conversions under the
>>> covers as needed when adding an empty NSD to a file system.
>>>
>>> I'd strongly recommend getting a fresh start by formatting a new file
>>> system. Many things have changed over the course of the last few years.
>>> In particular, having a 4K-aligned file system can be a pretty big
deal,
>>> depending on what hardware one is going to deploy in the future, and
>>> this is something that can't be bolted onto an existing file system.
>>> Having 4K inodes is very handy for many reasons. New directory format
>>> and NSD format changes are attractive, too. And disks generally tend to
>>> get larger with time, and at some point you may want to add a disk to
an
>>> existing storage pool that's larger than the existing allocation map
>>> format allows. Obviously, it's more hassle to migrate data to a new
file
>>> system, as opposed to extending an existing one. In a perfect world,
>>> GPFS would offer a conversion tool that seamlessly and robustly
converts
>>> old file systems, making them as good as new, but in the real world
such
>>> a tool doesn't exist. Getting a clean slate by formatting a new file
>>> system every few years is a good long-term investment of time, although
>>> it comes front-loaded with extra work.
>>>
>>> yuri
>>>
>>> Inactive hide details for Aaron Knister ---10/10/2016 04:45:31 PM---Can
>>> one format NSDv2 NSDs and put them in a filesystem withAaron Knister
>>> ---10/10/2016 04:45:31 PM---Can one format NSDv2 NSDs and put them in a
>>> filesystem with NSDv1 NSD's? -Aaron
>>>
>>> From: Aaron Knister <aaron.s.knister at nasa.gov>
>>> To: <gpfsug-discuss at spectrumscale.org>,
>>> Date: 10/10/2016 04:45 PM
>>> Subject: Re: [gpfsug-discuss] Hardware refresh
>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>>
>>>
------------------------------------------------------------------------
>>>
>>>
>>>
>>> Can one format NSDv2 NSDs and put them in a filesystem with NSDv1
NSD's?
>>>
>>> -Aaron
>>>
>>> On 10/10/16 7:40 PM, Luis Bolinches wrote:
>>>> Hi
>>>>
>>>> Creating a new FS sounds like a best way to go. NSDv2 being a very
good
>>>> reason to do so.
>>>>
>>>> AFM for migrations is quite good, latest versions allows to use NSD
>>>> protocol for mounts as well. Olaf did a great job explaining this
>>>> scenario on the redbook chapter 6
>>>>
>>>> http://www.redbooks.ibm.com/abstracts/sg248254.html?Open
>>>>
>>>> --
>>>> Cheers
>>>>
>>>> On 10 Oct 2016, at 23.05, Buterbaugh, Kevin L
>>>> <Kevin.Buterbaugh at Vanderbilt.Edu
>>>> <mailto:Kevin.Buterbaugh at Vanderbilt.Edu>> wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> The last time we did something like this was 2010 (we’re doing
rolling
>>>>> refreshes now), so there are probably lots of better ways to do this
>>>>> than what we did, but we:
>>>>>
>>>>> 1) set up the new hardware
>>>>> 2) created new filesystems (so that we could make adjustments we
>>>>> wanted to make that can only be made at FS creation time)
>>>>> 3) used rsync to make a 1st pass copy of everything
>>>>> 4) coordinated a time with users / groups to do a 2nd rsync when they
>>>>> weren’t active
>>>>> 5) used symbolic links during the transition (i.e. rm -rvf
>>>>> /gpfs0/home/joeuser; ln -s /gpfs2/home/joeuser /gpfs0/home/joeuser)
>>>>> 6) once everybody was migrated, updated the symlinks (i.e. /home
>>>>> became a symlink to /gpfs2/home)
>>>>>
>>>>> HTHAL…
>>>>>
>>>>> Kevin
>>>>>
>>>>>> On Oct 10, 2016, at 2:56 PM, Mark.Bush at siriuscom.com
>>>>>> <mailto:Mark.Bush at siriuscom.com> wrote:
>>>>>>
>>>>>> Have a very old cluster built on IBM X3650’s and DS3500.  Need to
>>>>>> refresh hardware.  Any lessons learned in this process?  Is it
>>>>>> easiest to just build new cluster and then use AFM?  Add to existing
>>>>>> cluster then decommission nodes?  What is the recommended process
for
>>>>>> this?
>>>>>>
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> This message (including any attachments) is intended only for the
use
>>>>>> of the individual or entity to which it is addressed and may contain
>>>>>> information that is non-public, proprietary, privileged,
>>>>>> confidential, and exempt from disclosure under applicable law. If
you
>>>>>> are not the intended recipient, you are hereby notified that any
use,
>>>>>> dissemination, distribution, or copying of this communication is
>>>>>> strictly prohibited. This message may be viewed by parties at Sirius
>>>>>> Computer Solutions other than those named in the message header.
This
>>>>>> message does not contain an official representation of Sirius
>>>>>> Computer Solutions. If you have received this communication in
error,
>>>>>> notify Sirius Computer Solutions immediately and (i) destroy this
>>>>>> message if a facsimile or (ii) delete this message immediately if
>>>>>> this is an electronic communication. Thank you.
>>>>>>
>>>>>> Sirius Computer Solutions <http://www.siriuscom.com/>
>>>>>> _______________________________________________
>>>>>> gpfsug-discuss mailing list
>>>>>> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>
>>>>>>>>>> Kevin Buterbaugh - Senior System Administrator
>>>>> Vanderbilt University - Advanced Computing Center for Research and
>>>>> Education
>>>>> Kevin.Buterbaugh at vanderbilt.edu
>>>>> <mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633
>>>>>
>>>>>
>>>>>
>>>>
>>>> Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
>>>> Oy IBM Finland Ab
>>>> PL 265, 00101 Helsinki, Finland
>>>> Business ID, Y-tunnus: 0195876-3
>>>> Registered in Finland
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20161011/3a17bb86/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20161011/3a17bb86/attachment-0002.gif>


More information about the gpfsug-discuss mailing list