[gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss

Aaron Knister aaron.s.knister at nasa.gov
Thu Aug 24 13:56:49 BST 2017


Thanks Felipe, and everything you said makes sense and I think holds
true to my experiences concerning different workloads affecting
likelihood of hitting various problems (especially being one of only a
handful of sites that hit that 301 SGpanic error from several years
back). Perhaps language as subtle as "internal testing revealed" vs
"based on reports from customer sites" could be used? But then again I
imagine you could encounter a case where you discover something in
testing that a customer site subsequently experiences which might limit
the usefulness of the wording. I still think it's useful to know if an
issue has been exacerbated or triggered by in the wild workloads vs what
I imagine to be quite rigorous lab testing perhaps deigned to shake out
certain bugs.

-Aaron

On 8/23/17 12:40 AM, Felipe Knop wrote:
> Aaron,
>
> IBM's policy is to issue a flash when such data corruption/loss
> problem has been identified, even if the problem has never been
> encountered by any customer. In fact, most of the flashes have been
> the result of internal test activity, even though the discovery took
> place after the affected versions/PTFs have already been released.
>  This is the case of two of the recent flashes:
>
> http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293
>
> http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487
>
> The flashes normally do not indicate the risk level that a given
> problem has of being hit, since there are just too many variables at
> play, given that clusters and workloads vary significantly.
>
> The first issue above appears to be uncommon (and potentially rare).
>  The second issue seems to have a higher probability of occurring --
> and as described in the flash, the problem is triggered by failures
> being encountered while running one of the commands listed in the
> "Users Affected" section of the writeup.
>
> I don't think precise recommendations could be given on
>
>  if the bugs fall in the category of "drop everything and patch *now*"
> or "this is a theoretically nasty bug but we've yet to see it in the wild"
>
> since different clusters, configuration, or workload may drastically
> affect the the likelihood of hitting the problem.  On the other hand,
> when coming up with the text for the flash, the team attempts to
> provide as much information as possible/available on the known
> triggers and mitigation circumstances.
>
>   Felipe
>
> ----
> Felipe Knop                                     knop at us.ibm.com
> GPFS Development and Security
> IBM Systems
> IBM Building 008
> 2455 South Rd, Poughkeepsie, NY 12601
> (845) 433-9314  T/L 293-9314
>
>
>
>
>
> From:        Aaron Knister <aaron.knister at gmail.com>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        08/22/2017 10:37 AM
> Subject:        Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale
> could lead to data loss
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------------------------------------------------
>
>
>
> Hi Jochen,
>
> I share your concern about data loss bugs and I too have found it
> troubling especially since the 4.2 stream is in my immediate future
> (although I would have rather stayed on 4.1 due to my perception of
> stability/integrity issues in 4.2). By and large 4.1 has been
> *extremely* stable for me.
>
> While not directly related to the stability concerns, I'm curious as
> to why your customer sites are requiring downtime to do the upgrades?
> While, of course, individual servers need to be taken offline to
> update GPFS the collective should be able to stay up. Perhaps your
> customer environments just don't lend themselves to that. 
>
> It occurs to me that some of these bugs sound serious (and indeed I
> believe this one is) I recently found myself jumping prematurely into
> an update for the metanode filesize corruption bug that as it turns
> out that while very scary sounding is not necessarily a particularly
> common bug (if I understand correctly). Perhaps it would be helpful if
> IBM could clarify the believed risk of these updates or give us some
> indication if the bugs fall in the category of "drop everything and
> patch *now*" or "this is a theoretically nasty bug but we've yet to
> see it in the wild". I could imagine IBM legal wanting to avoid a
> situation where IBM indicates something is low risk but someone hits
> it and it eats data. Although many companies do this with security
> patches so perhaps it's a non-issue.
>
> From my perspective I don't think existing customers are being
> "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt
> to an ever-changing world and I think these features are necessary and
> useful. Perhaps Scale would benefit from more resources being
> dedicated to QA/Testing which isn't a particularly sexy thing-- it
> doesn't result in any new shiny features for customers (although "not
> eating your data" is a feature I find really attractive).
>
> Anyway, I hope IBM can find a way to minimize the frequency of these
> bugs. Personally speaking, I'm pretty convinced, it's not for lack of
> capability or dedication on the part of the great folks actually
> writing the code.
>
> -Aaron
>
> On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen
> <_Jochen.Zeller at sva.de_ <mailto:Jochen.Zeller at sva.de>> wrote:
> Dear community,
>  
> this morning I started in a good mood, until I’ve checked my mailbox.
> Again a reported bug in Spectrum Scale that could lead to data loss.
> During the last year I was looking for a stable Scale version, and
> each time I’ve thought: “Yes, this one is stable and without serious
> data loss bugs” - a few day later, IBM announced a new APAR with
> possible data loss for this version.
>  
> I am supporting many clients in central Europe. They store databases,
> backup data, life science data, video data, results of technical
> computing, do HPC on the file systems, etc. Some of them had to change
> their Scale version nearly monthly during the last year to prevent
> running in one of the serious data loss bugs in Scale. From my
> perspective, it was and is a shame to inform clients about new
> reported bugs right after the last update. From client perspective, it
> was and is a lot of work and planning to do to get a new downtime for
> updates. And their internal customers are not satisfied with those
> many downtimes of the clusters and applications.
>  
> For me, it seems that Scale development is working on features for a
> specific project or client, to achieve special requirements. But they
> forgot the existing clients, using Scale for storing important data or
> running important workloads on it.
>  
> To make us more visible, I’ve used the IBM recommended way to notify
> about mandatory enhancements, the less favored RFE:
>  
> _http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334_
>  
> If you like, vote for more reliability in Scale.
>  
> I hope this a good way to show development and responsible persons
> that we have trouble and are not satisfied with the quality of the
> releases.
>  
>  
> Regards,
>  
> Jochen
>  
>  
>  
>  
>  
>  
>  
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at _spectrumscale.org_
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=-fp39C0mIHzPe7AhJGIwRCpmdKn0jC1QYEyM2DzYFZQ&e=>_
> __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e=>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e=
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170824/547f0316/attachment-0002.htm>


More information about the gpfsug-discuss mailing list