[gpfsug-discuss] Preferred NSD

Alex Chekholko alex at calicolabs.com
Tue Mar 13 17:48:21 GMT 2018


Hi Lukas,

I would like to discourage you from building a large distributed clustered
filesystem made of many unreliable components.  You will need to
overprovision your interconnect and will also spend a lot of time in
"healing" or "degraded" state.

It is typically cheaper to centralize the storage into a subset of nodes
and configure those to be more highly available.  E.g. of your 60 nodes,
take 8 and put all the storage into those and make that a dedicated GPFS
cluster with no compute jobs on those nodes.  Again, you'll still need
really beefy and reliable interconnect to make this work.

Stepping back; what is the actual problem you're trying to solve?  I have
certainly been in that situation before, where the problem is more like: "I
have a fixed hardware configuration that I can't change, and I want to try
to shoehorn a parallel filesystem onto that."

I would recommend looking closer at your actual workloads.  If this is a
"scratch" filesystem and file access is mostly from one node at a time,
it's not very useful to make two additional copies of that data on other
nodes, and it will only slow you down.

Regards,
Alex

On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek <xhejtman at ics.muni.cz>
wrote:

> On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote:
> > Lukas,
> > It looks like you are proposing a setup which uses your compute servers
> as storage servers also?
>
> yes, exactly. I would like to utilise NVMe SSDs that are in every compute
> servers.. Using them as a shared scratch area with GPFS is one of the
> options.
>
> >
> >   *   I'm thinking about the following setup:
> > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> >
> > There is nothing wrong with this concept, for instance see
> > https://www.beegfs.io/wiki/BeeOND
> >
> > I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
> > You should look at "failure zones" also.
>
> you still need the storage servers and local SSDs to use only for caching,
> do
> I understand correctly?
>
> >
> > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-
> bounces at spectrumscale.org] On Behalf Of Knister, Aaron S.
> (GSFC-606.2)[COMPUTER SCIENCE CORP]
> > Sent: Monday, March 12, 2018 4:14 PM
> > To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> > Subject: Re: [gpfsug-discuss] Preferred NSD
> >
> > Hi Lukas,
> >
> > Check out FPO mode. That mimics Hadoop's data placement features. You
> can have up to 3 replicas both data and metadata but still the downside,
> though, as you say is the wrong node failures will take your cluster down.
> >
> > You might want to check out something like Excelero's NVMesh (note: not
> an endorsement since I can't give such things) which can create logical
> volumes across all your NVMe drives. The product has erasure coding on
> their roadmap. I'm not sure if they've released that feature yet but in
> theory it will give better fault tolerance *and* you'll get more efficient
> usage of your SSDs.
> >
> > I'm sure there are other ways to skin this cat too.
> >
> > -Aaron
> >
> >
> >
> > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek <xhejtman at ics.muni.cz
> <mailto:xhejtman at ics.muni.cz>> wrote:
> > Hello,
> >
> > I'm thinking about the following setup:
> > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> >
> > I would like to setup shared scratch area using GPFS and those NVMe
> SSDs. Each
> > SSDs as on NSD.
> >
> > I don't think like 5 or more data/metadata replicas are practical here.
> On the
> > other hand, multiple node failures is something really expected.
> >
> > Is there a way to instrument that local NSD is strongly preferred to
> store
> > data? I.e. node failure most probably does not result in unavailable
> data for
> > the other nodes?
> >
> > Or is there any other recommendation/solution to build shared scratch
> with
> > GPFS in such setup? (Do not do it including.)
> >
> > --
> > Lukáš Hejtmánek
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > -- The information contained in this communication and any attachments
> is confidential and may be privileged, and is for the sole use of the
> intended recipient(s). Any unauthorized review, use, disclosure or
> distribution is prohibited. Unless explicitly stated otherwise in the body
> of this communication or the attachment thereto (if any), the information
> is provided on an AS-IS basis without any express or implied warranties or
> liabilities. To the extent you are relying on this information, you are
> doing so at your own risk. If you are not the intended recipient, please
> notify the sender immediately by replying to this message and destroy all
> copies of this message and any attachments. Neither the sender nor the
> company/group of companies he or she represents shall be liable for the
> proper and complete transmission of the information contained in this
> communication, or for any delay in its receipt.
>
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> --
> Lukáš Hejtmánek
>
> Linux Administrator only because
>   Full Time Multitasking Ninja
>   is not an official job title
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180313/f6a8cfb5/attachment-0002.htm>


More information about the gpfsug-discuss mailing list