[gpfsug-discuss] Preferred NSD

Jeffrey R. Lang JRLang at uwyo.edu
Wed Mar 14 14:11:35 GMT 2018


Something I haven't heard in this discussion, it that of licensing of GPFS.

I believe that once you export disks from a node it then becomes a server node and the license may need to be changed, from client to server.  There goes the budget. 



-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Lukas Hejtmanek
Sent: Wednesday, March 14, 2018 4:28 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Preferred NSD

Hello,

thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes.

On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote:
> I would like to discourage you from building a large distributed 
> clustered filesystem made of many unreliable components.  You will 
> need to overprovision your interconnect and will also spend a lot of 
> time in "healing" or "degraded" state.
> 
> It is typically cheaper to centralize the storage into a subset of 
> nodes and configure those to be more highly available.  E.g. of your 
> 60 nodes, take 8 and put all the storage into those and make that a 
> dedicated GPFS cluster with no compute jobs on those nodes.  Again, 
> you'll still need really beefy and reliable interconnect to make this work.
> 
> Stepping back; what is the actual problem you're trying to solve?  I 
> have certainly been in that situation before, where the problem is 
> more like: "I have a fixed hardware configuration that I can't change, 
> and I want to try to shoehorn a parallel filesystem onto that."
> 
> I would recommend looking closer at your actual workloads.  If this is 
> a "scratch" filesystem and file access is mostly from one node at a 
> time, it's not very useful to make two additional copies of that data 
> on other nodes, and it will only slow you down.
> 
> Regards,
> Alex
> 
> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek 
> <xhejtman at ics.muni.cz>
> wrote:
> 
> > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote:
> > > Lukas,
> > > It looks like you are proposing a setup which uses your compute 
> > > servers
> > as storage servers also?
> >
> > yes, exactly. I would like to utilise NVMe SSDs that are in every 
> > compute servers.. Using them as a shared scratch area with GPFS is 
> > one of the options.
> >
> > >
> > >   *   I'm thinking about the following setup:
> > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB 
> > > interconnected
> > >
> > > There is nothing wrong with this concept, for instance see 
> > > https://www.beegfs.io/wiki/BeeOND
> > >
> > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
> > > You should look at "failure zones" also.
> >
> > you still need the storage servers and local SSDs to use only for 
> > caching, do I understand correctly?
> >
> > >
> > > From: gpfsug-discuss-bounces at spectrumscale.org 
> > > [mailto:gpfsug-discuss-
> > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S.
> > (GSFC-606.2)[COMPUTER SCIENCE CORP]
> > > Sent: Monday, March 12, 2018 4:14 PM
> > > To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> > > Subject: Re: [gpfsug-discuss] Preferred NSD
> > >
> > > Hi Lukas,
> > >
> > > Check out FPO mode. That mimics Hadoop's data placement features. 
> > > You
> > can have up to 3 replicas both data and metadata but still the 
> > downside, though, as you say is the wrong node failures will take your cluster down.
> > >
> > > You might want to check out something like Excelero's NVMesh 
> > > (note: not
> > an endorsement since I can't give such things) which can create 
> > logical volumes across all your NVMe drives. The product has erasure 
> > coding on their roadmap. I'm not sure if they've released that 
> > feature yet but in theory it will give better fault tolerance *and* 
> > you'll get more efficient usage of your SSDs.
> > >
> > > I'm sure there are other ways to skin this cat too.
> > >
> > > -Aaron
> > >
> > >
> > >
> > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek 
> > > <xhejtman at ics.muni.cz
> > <mailto:xhejtman at ics.muni.cz>> wrote:
> > > Hello,
> > >
> > > I'm thinking about the following setup:
> > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB 
> > > interconnected
> > >
> > > I would like to setup shared scratch area using GPFS and those 
> > > NVMe
> > SSDs. Each
> > > SSDs as on NSD.
> > >
> > > I don't think like 5 or more data/metadata replicas are practical here.
> > On the
> > > other hand, multiple node failures is something really expected.
> > >
> > > Is there a way to instrument that local NSD is strongly preferred 
> > > to
> > store
> > > data? I.e. node failure most probably does not result in 
> > > unavailable
> > data for
> > > the other nodes?
> > >
> > > Or is there any other recommendation/solution to build shared 
> > > scratch
> > with
> > > GPFS in such setup? (Do not do it including.)
> > >
> > > --
> > > Lukáš Hejtmánek
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org 
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > -- The information contained in this communication and any 
> > > attachments
> > is confidential and may be privileged, and is for the sole use of 
> > the intended recipient(s). Any unauthorized review, use, disclosure 
> > or distribution is prohibited. Unless explicitly stated otherwise in 
> > the body of this communication or the attachment thereto (if any), 
> > the information is provided on an AS-IS basis without any express or 
> > implied warranties or liabilities. To the extent you are relying on 
> > this information, you are doing so at your own risk. If you are not 
> > the intended recipient, please notify the sender immediately by 
> > replying to this message and destroy all copies of this message and 
> > any attachments. Neither the sender nor the company/group of 
> > companies he or she represents shall be liable for the proper and 
> > complete transmission of the information contained in this communication, or for any delay in its receipt.
> >
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org 
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
> >
> > --
> > Lukáš Hejtmánek
> >
> > Linux Administrator only because
> >   Full Time Multitasking Ninja
> >   is not an official job title
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >

> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Lukáš Hejtmánek

Linux Administrator only because
  Full Time Multitasking Ninja
  is not an official job title
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



More information about the gpfsug-discuss mailing list