[gpfsug-discuss] Preferred NSD

Wed Mar 14 09:28:15 GMT 2018

Hello,

thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe
disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD
that could build nice shared scratch. Moreover, I have no different HW or place 
to put these SSDs into. They have to be in the compute nodes.

On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote:
> I would like to discourage you from building a large distributed clustered
> filesystem made of many unreliable components.  You will need to
> overprovision your interconnect and will also spend a lot of time in
> "healing" or "degraded" state.
> 
> It is typically cheaper to centralize the storage into a subset of nodes
> and configure those to be more highly available.  E.g. of your 60 nodes,
> take 8 and put all the storage into those and make that a dedicated GPFS
> cluster with no compute jobs on those nodes.  Again, you'll still need
> really beefy and reliable interconnect to make this work.
> 
> Stepping back; what is the actual problem you're trying to solve?  I have
> certainly been in that situation before, where the problem is more like: "I
> have a fixed hardware configuration that I can't change, and I want to try
> to shoehorn a parallel filesystem onto that."
> 
> I would recommend looking closer at your actual workloads.  If this is a
> "scratch" filesystem and file access is mostly from one node at a time,
> it's not very useful to make two additional copies of that data on other
> nodes, and it will only slow you down.
> 
> Regards,
> Alex
> 
> On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek <xhejtman at ics.muni.cz>
> wrote:
> 
> > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote:
> > > Lukas,
> > > It looks like you are proposing a setup which uses your compute servers
> > as storage servers also?
> >
> > yes, exactly. I would like to utilise NVMe SSDs that are in every compute
> > servers.. Using them as a shared scratch area with GPFS is one of the
> > options.
> >
> > >
> > >   *   I'm thinking about the following setup:
> > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> > >
> > > There is nothing wrong with this concept, for instance see
> > > https://www.beegfs.io/wiki/BeeOND
> > >
> > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
> > > You should look at "failure zones" also.
> >
> > you still need the storage servers and local SSDs to use only for caching,
> > do
> > I understand correctly?
> >
> > >
> > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-
> > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S.
> > (GSFC-606.2)[COMPUTER SCIENCE CORP]
> > > Sent: Monday, March 12, 2018 4:14 PM
> > > To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> > > Subject: Re: [gpfsug-discuss] Preferred NSD
> > >
> > > Hi Lukas,
> > >
> > > Check out FPO mode. That mimics Hadoop's data placement features. You
> > can have up to 3 replicas both data and metadata but still the downside,
> > though, as you say is the wrong node failures will take your cluster down.
> > >
> > > You might want to check out something like Excelero's NVMesh (note: not
> > an endorsement since I can't give such things) which can create logical
> > volumes across all your NVMe drives. The product has erasure coding on
> > their roadmap. I'm not sure if they've released that feature yet but in
> > theory it will give better fault tolerance *and* you'll get more efficient
> > usage of your SSDs.
> > >
> > > I'm sure there are other ways to skin this cat too.
> > >
> > > -Aaron
> > >
> > >
> > >
> > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek <xhejtman at ics.muni.cz
> > <mailto:xhejtman at ics.muni.cz>> wrote:
> > > Hello,
> > >
> > > I'm thinking about the following setup:
> > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> > >
> > > I would like to setup shared scratch area using GPFS and those NVMe
> > SSDs. Each
> > > SSDs as on NSD.
> > >
> > > I don't think like 5 or more data/metadata replicas are practical here.
> > On the
> > > other hand, multiple node failures is something really expected.
> > >
> > > Is there a way to instrument that local NSD is strongly preferred to
> > store
> > > data? I.e. node failure most probably does not result in unavailable
> > data for
> > > the other nodes?
> > >
> > > Or is there any other recommendation/solution to build shared scratch
> > with
> > > GPFS in such setup? (Do not do it including.)
> > >
> > > --
> > > Lukáš Hejtmánek
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > -- The information contained in this communication and any attachments
> > is confidential and may be privileged, and is for the sole use of the
> > intended recipient(s). Any unauthorized review, use, disclosure or
> > distribution is prohibited. Unless explicitly stated otherwise in the body
> > of this communication or the attachment thereto (if any), the information
> > is provided on an AS-IS basis without any express or implied warranties or
> > liabilities. To the extent you are relying on this information, you are
> > doing so at your own risk. If you are not the intended recipient, please
> > notify the sender immediately by replying to this message and destroy all
> > copies of this message and any attachments. Neither the sender nor the
> > company/group of companies he or she represents shall be liable for the
> > proper and complete transmission of the information contained in this
> > communication, or for any delay in its receipt.
> >
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
> >
> > --
> > Lukáš Hejtmánek
> >
> > Linux Administrator only because
> >   Full Time Multitasking Ninja
> >   is not an official job title
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >

> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
Lukáš Hejtmánek

Linux Administrator only because
  Full Time Multitasking Ninja 
  is not an official job title