[gpfsug-discuss] Strategies - servers with local SAS disks

Thu Dec 1 18:22:36 GMT 2016

Hi Bob,

If you mean #4 with 2x data replication...then I would be very wary as the
chance of data loss would be very high given local disk failure rates.  So
I think its really #4 with 3x replication vs #3 with 2x replication (and
raid5/6 in node) (with maybe 3x for metadata).  The space overhead is
somewhat similar, but the rebuild times should be much faster for #3 given
that a failed disk will not place any load on the storage network (as well
there will be less data placed on network).

Dean

From:	"Oesterlin, Robert" <Robert.Oesterlin at nuance.com>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:	12/01/2016 04:48 AM
Subject:	Re: [gpfsug-discuss] Strategies - servers with local SAS disks
Sent by:	gpfsug-discuss-bounces at spectrumscale.org

Some interesting discussion here. Perhaps I should have been a bit clearer
on what I’m looking at here:

I have 12 servers with 70*4TB drives each – so the hardware is free. What’s
the best strategy for using these as GPFS NSD servers, given that I don’t
want to relay on any “bleeding edge” technologies.

1) My first choice would be GNR on commodity hardware – if IBM would give
that to us. :-)
2) Use standard RAID groups with no replication – downside is data
availability of you lose an NSD and RAID group rebuild time with large
disks
3) RAID groups with replication – but I lose a LOT of space (20% for RAID +
50% of what’s left for replication)
4) No raid groups, single NSD per disk, single failure group per servers,
replication. Downside here is I need to restripe every time a disk fails to
get the filesystem back to a good state. Might be OK using QoS to get the
IO impact down
5) FPO doesn’t seem to by me anything, as these are straight NSD servers
and no computation is going on these servers, and I still must live with
the re-stripe.

Option (4) seems the best of the “no great options” I have in front of me.

Bob Oesterlin
Sr Principal Storage Engineer, Nuance

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Zachary Giles
<zgiles at gmail.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Wednesday, November 30, 2016 at 10:27 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Strategies - servers with local
SAS disks

Aaron, Thanks for jumping onboard. It's nice to see others confirming this.
Sometimes I feel alone on this topic.

It's should also be possible to use ZFS with ZVOLs presented as block
devices for a backing store for NSDs. I'm not claiming it's stable, nor a
good idea, nor performant.. but should be possible. :) There are various
reports about it. Might be at least worth looking in to compared to Linux
"md raid" if one truly needs an all-software solution that already exists.
Something to think about and test over.

On Wed, Nov 30, 2016 at 11:15 PM, Aaron Knister <aaron.s.knister at nasa.gov>
wrote:
 Thanks Zach, I was about to echo similar sentiments and you saved me a ton
 of typing :)

 Bob, I know this doesn't help you today since I'm pretty sure its not yet
 available, but if one scours the interwebs they can find mention of
 something called Mestor.

 There's very very limited information here:

 -
 https://indico.cern.ch/event/531810/contributions/2306222/attachments/1357265/2053960/Spectrum_Scale-HEPIX_V1a.pdf

 -
 https://www.yumpu.com/en/document/view/5544551/ibm-system-x-gpfs-storage-server-stfc
  (slide 20)

 Sounds like if it were available it would fit this use case very well.

 I also had preliminary success with using sheepdog (
 https://sheepdog.github.io/sheepdog/) as a backing store for GPFS in a
 similar situation. It's perhaps at a very high conceptually level similar
 to Mestor. You erasure code your data across the nodes w/ the SAS disks
 and then present those block devices to your NSD servers. I proved it
 could work but never tried to to much with it because the requirements
 changed.

 My money would be on your first option-- creating local RAIDs and then
 replicating to give you availability in the event a node goes offline.

 -Aaron

 On 11/30/16 10:59 PM, Zachary Giles wrote:
  Just remember that replication protects against data availability, not
  integrity. GPFS still requires the underlying block device to return
  good data.

  If you're using it on plain disks (SAS or SSD), and the drive returns
  corrupt data, GPFS won't know any better and just deliver it to the
  client. Further, if you do a partial read followed by a write, both
  replicas could be destroyed. There's also no efficient way to force use
  of a second replica if you realize the first is bad, short of taking the
  first entirely offline. In that case while migrating data, there's no
  good way to prevent read-rewrite of other corrupt data on your drive
  that has the "good copy" while restriping off a faulty drive.

  Ideally RAID would have a goal of only returning data that passed the
  RAID algorithm, so shouldn't be corrupt, or made good by recreating from
  parity. However, as we all know RAID controllers are definitely prone to
  failures as well for many reasons, but at least a drive can go bad in
  various ways (bad sectors, slow, just dead, poor SSD cell wear, etc)
  without (hopefully) silent corruption..

  Just something to think about while considering replication ..

  On Wed, Nov 30, 2016 at 11:28 AM, Uwe Falke <UWEFALKE at de.ibm.com
  <mailto:UWEFALKE at de.ibm.com>> wrote:

      I have once set up a small system with just a few SSDs in two NSD
      servers,
      providin a scratch file system in a computing cluster.
      No RAID, two replica.
      works, as long the admins do not do silly things (like rebooting
  servers
      in sequence without checking for disks being up in between).
      Going for RAIDs without GPFS replication protects you against single
      disk
      failures, but you're lost if just one of your NSD servers goes off.

      FPO makes sense only sense IMHO if your NSD servers are also
  processing
      the data (and then you need to control that somehow).

      Other ideas? what else can you do with GPFS and local disks than
      what you
      considered? I suppose nothing reasonable ...

      Mit freundlichen Grüßen / Kind regards

      Dr. Uwe Falke

      IT Specialist
      High Performance Computing Services / Integrated Technology
  Services /
      Data Center Services

  -------------------------------------------------------------------------------------------------------------------------------------------

      IBM Deutschland
      Rathausstr. 7
      09111 Chemnitz
      Phone: +49 371 6978 2165 <tel:%2B49%20371%206978%202165>
      Mobile: +49 175 575 2877 <tel:%2B49%20175%20575%202877>
      E-Mail: uwefalke at de.ibm.com <mailto:uwefalke at de.ibm.com>

  -------------------------------------------------------------------------------------------------------------------------------------------

      IBM Deutschland Business & Technology Services GmbH /
  Geschäftsführung:
      Frank Hammer, Thorsten Moehring
      Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht
      Stuttgart,
      HRB 17122

      From:   "Oesterlin, Robert" <Robert.Oesterlin at nuance.com
      <mailto:Robert.Oesterlin at nuance.com>>
      To:     gpfsug main discussion list
      <gpfsug-discuss at spectrumscale.org
      <mailto:gpfsug-discuss at spectrumscale.org>>
      Date:   11/30/2016 03:34 PM
      Subject:        [gpfsug-discuss] Strategies - servers with local SAS
      disks
      Sent by:        gpfsug-discuss-bounces at spectrumscale.org
      <mailto:gpfsug-discuss-bounces at spectrumscale.org>

      Looking for feedback/strategies in setting up several GPFS servers
  with
      local SAS. They would all be part of the same file system. The
      systems are
      all similar in configuration - 70 4TB drives.

      Options I?m considering:

      - Create RAID arrays of the disks on each server (worried about the
  RAID
      rebuild time when a drive fails with 4, 6, 8TB drives)
      - No RAID with 2 replicas, single drive per NSD. When a drive fails,
      recreate the NSD ? but then I need to fix up the data replication via
      restripe
      - FPO ? with multiple failure groups -  letting the system manage
      replica
      placement and then have GPFS due the restripe on disk failure
      automatically

      Comments or other ideas welcome.

      Bob Oesterlin
      Sr Principal Storage Engineer, Nuance
      507-269-0413 <tel:507-269-0413>

       _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss
      <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>

      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss
      <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>

  --
  Zach Giles
  zgiles at gmail.com <mailto:zgiles at gmail.com>

  _______________________________________________
  gpfsug-discuss mailing list
  gpfsug-discuss at spectrumscale.org
  http://gpfsug.org/mailman/listinfo/gpfsug-discuss

 --
 Aaron Knister
 NASA Center for Climate Simulation (Code 606.2)
 Goddard Space Flight Center
 (301) 286-2776

 _______________________________________________
 gpfsug-discuss mailing list
 gpfsug-discuss at spectrumscale.org
 http://gpfsug.org/mailman/listinfo/gpfsug-discuss

--
Zach Giles
zgiles at gmail.com_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20161201/2e0e930c/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20161201/2e0e930c/attachment-0002.gif>