[gpfsug-discuss] Preferred NSD

Wed Mar 14 10:24:39 GMT 2018

I would look at using LROC and possibly using HAWC ...

Note you need to be a bit careful with HAWC client side and failure group placement.

Simon

On 14/03/2018, 09:28, "gpfsug-discuss-bounces at spectrumscale.org on behalf of xhejtman at ics.muni.cz" <gpfsug-discuss-bounces at spectrumscale.org on behalf of xhejtman at ics.muni.cz> wrote:

    Hello,

    thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe
    disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD
    that could build nice shared scratch. Moreover, I have no different HW or place 
    to put these SSDs into. They have to be in the compute nodes.

    On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote:
    > I would like to discourage you from building a large distributed clustered
    > filesystem made of many unreliable components.  You will need to
    > overprovision your interconnect and will also spend a lot of time in
    > "healing" or "degraded" state.
    > 
    > It is typically cheaper to centralize the storage into a subset of nodes
    > and configure those to be more highly available.  E.g. of your 60 nodes,
    > take 8 and put all the storage into those and make that a dedicated GPFS
    > cluster with no compute jobs on those nodes.  Again, you'll still need
    > really beefy and reliable interconnect to make this work.
    > 
    > Stepping back; what is the actual problem you're trying to solve?  I have
    > certainly been in that situation before, where the problem is more like: "I
    > have a fixed hardware configuration that I can't change, and I want to try
    > to shoehorn a parallel filesystem onto that."
    > 
    > I would recommend looking closer at your actual workloads.  If this is a
    > "scratch" filesystem and file access is mostly from one node at a time,
    > it's not very useful to make two additional copies of that data on other
    > nodes, and it will only slow you down.
    > 
    > Regards,
    > Alex
    > 
    > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek <xhejtman at ics.muni.cz>
    > wrote:
    > 
    > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote:
    > > > Lukas,
    > > > It looks like you are proposing a setup which uses your compute servers
    > > as storage servers also?
    > >
    > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute
    > > servers.. Using them as a shared scratch area with GPFS is one of the
    > > options.
    > >
    > > >
    > > >   *   I'm thinking about the following setup:
    > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
    > > >
    > > > There is nothing wrong with this concept, for instance see
    > > > https://www.beegfs.io/wiki/BeeOND
    > > >
    > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
    > > > You should look at "failure zones" also.
    > >
    > > you still need the storage servers and local SSDs to use only for caching,
    > > do
    > > I understand correctly?
    > >
    > > >
    > > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-
    > > bounces at spectrumscale.org] On Behalf Of Knister, Aaron S.
    > > (GSFC-606.2)[COMPUTER SCIENCE CORP]
    > > > Sent: Monday, March 12, 2018 4:14 PM
    > > > To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
    > > > Subject: Re: [gpfsug-discuss] Preferred NSD
    > > >
    > > > Hi Lukas,
    > > >
    > > > Check out FPO mode. That mimics Hadoop's data placement features. You
    > > can have up to 3 replicas both data and metadata but still the downside,
    > > though, as you say is the wrong node failures will take your cluster down.
    > > >
    > > > You might want to check out something like Excelero's NVMesh (note: not
    > > an endorsement since I can't give such things) which can create logical
    > > volumes across all your NVMe drives. The product has erasure coding on
    > > their roadmap. I'm not sure if they've released that feature yet but in
    > > theory it will give better fault tolerance *and* you'll get more efficient
    > > usage of your SSDs.
    > > >
    > > > I'm sure there are other ways to skin this cat too.
    > > >
    > > > -Aaron
    > > >
    > > >
    > > >
    > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek <xhejtman at ics.muni.cz
    > > <mailto:xhejtman at ics.muni.cz>> wrote:
    > > > Hello,
    > > >
    > > > I'm thinking about the following setup:
    > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
    > > >
    > > > I would like to setup shared scratch area using GPFS and those NVMe
    > > SSDs. Each
    > > > SSDs as on NSD.
    > > >
    > > > I don't think like 5 or more data/metadata replicas are practical here.
    > > On the
    > > > other hand, multiple node failures is something really expected.
    > > >
    > > > Is there a way to instrument that local NSD is strongly preferred to
    > > store
    > > > data? I.e. node failure most probably does not result in unavailable
    > > data for
    > > > the other nodes?
    > > >
    > > > Or is there any other recommendation/solution to build shared scratch
    > > with
    > > > GPFS in such setup? (Do not do it including.)
    > > >
    > > > --
    > > > Lukáš Hejtmánek
    > > > _______________________________________________
    > > > gpfsug-discuss mailing list
    > > > gpfsug-discuss at spectrumscale.org
    > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    > > > -- The information contained in this communication and any attachments
    > > is confidential and may be privileged, and is for the sole use of the
    > > intended recipient(s). Any unauthorized review, use, disclosure or
    > > distribution is prohibited. Unless explicitly stated otherwise in the body
    > > of this communication or the attachment thereto (if any), the information
    > > is provided on an AS-IS basis without any express or implied warranties or
    > > liabilities. To the extent you are relying on this information, you are
    > > doing so at your own risk. If you are not the intended recipient, please
    > > notify the sender immediately by replying to this message and destroy all
    > > copies of this message and any attachments. Neither the sender nor the
    > > company/group of companies he or she represents shall be liable for the
    > > proper and complete transmission of the information contained in this
    > > communication, or for any delay in its receipt.
    > >
    > > > _______________________________________________
    > > > gpfsug-discuss mailing list
    > > > gpfsug-discuss at spectrumscale.org
    > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    > >
    > >
    > > --
    > > Lukáš Hejtmánek
    > >
    > > Linux Administrator only because
    > >   Full Time Multitasking Ninja
    > >   is not an official job title
    > > _______________________________________________
    > > gpfsug-discuss mailing list
    > > gpfsug-discuss at spectrumscale.org
    > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    > >

    > _______________________________________________
    > gpfsug-discuss mailing list
    > gpfsug-discuss at spectrumscale.org
    > http://gpfsug.org/mailman/listinfo/gpfsug-discuss

    -- 
    Lukáš Hejtmánek

    Linux Administrator only because
      Full Time Multitasking Ninja 
      is not an official job title
    _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org
    http://gpfsug.org/mailman/listinfo/gpfsug-discuss