Spectrum Scale Users Group Argonne National Lab – June 2016

MidWest Users Group Event

Many thanks to Argonne National Lab (ANL) for hosting our recent Spectrum Scale Users Group Event. It’s nice to have an event in the MidWest given the geography spread in the US. Bob Oesterlin kicked things off with a social event Thursday night so some of us could share stories prior to the actual UG day. Friday morning was focused on IBM presentations and the majority of the afternoon went to user presentations.

To start, we received updates on both Spectrum Scale and ESS from Scott McFadden. Some notable priorities for 2016 include making sure that US customers have the opportunity and channel to give feedback to development in the early phase of the process and shape the result. Scott noted that there’s been great feedback in this respect from the UK users, so you heard it, let your voice be heard US users! There should be some follow up traffic on the mailing lists about that —watch that space if you are interested. There will also be news about more open betas that are accessible in a downloadable VM.

Additionally, input from the PMR and field teams is being leveraged more effectively. IBM is big and there’s recognition that in the past PMR and field information may not have been getting back to the development team effectively.

Security is another focus and an internal security audit is ongoing as mentioned at SC. Ease of use in configuring key management with ISKLM is coming in 4.2.1

During the Problem Determination session many comments were put on screen about how tricky it is to know if a GPFS cluster is healthy, and problem determination is tricky when it’s not. To that end, an mmhealth command is being developed to report on all the key components in a cluster. This should answer the questions of what components to monitor, what command to use to do so and how to interpret the results. The tool takes into consideration all of the interdepencies to report a high-level healty, degraded or unhealthy. mmhealth is being reviewed with user input as it is being developed.

For the GUI tools there were both screen shots and a live demo. The question was asked to “Raise your hand if you monitor waiters?” Lots of hands shot up. The follow up question was “Keep your hand up if fully understand them.” I think Sven was the only one who keep his hand up. The GUI tool is building up a long waiters componetnt to categorize an document waiters.

Sven Oehme gave us an overview of some features coming with 4.2.1. On Scale 4.2 there are > 700 parameters for tuning —many of them undocumented, but still used in production, so in many cases customers have a lot to figure out. It’s difficult for IBM to come up with default settings when the range of hardware and networking capabilties for any given site varies wildly. Still, to make it easier on the customer some auto-tuning capabilities are being added. For example, there will be a new worker thread setting that will auto-tune about 20 other related settings. Care is being taken to make sure that those who want to retain manual settings can do so, and there will be information about this in the documentation. Long term there is a goal to let the admin describe the system and let that inform parameter choices automatically.

The user presentations were interesting and included campus Active Directory Integration, Using GPFS on ZFS, GPFS-HPSS-Integration (GHI) and using AFM as a Burst Buffer. All presentations are available online, check them out.

All presentations are available here: http://www.spectrumscale.org/presentations/

Q&A:

Some questions from the audience included, there were many more, people weren’t shy about interjecting questions:

Q: While there is appreciation for quick development changes, there is a concern for the quality of the releases.
A: This is an area that’s being actively reviewed for improvement and better regression testing to make sure changes don’t negatively impact performance.

Q: How much is compression being used in the wild?
A: Not much production use, but people are interested for future implementation. Generally it takes a year for new features to be adopted in deployment.

Q: With mmbackup, how much data can you backup?
A: It depends largely upon your infrastructure and how much you can parallelize. Multiple TSM servers can be used for Spectrum Scale now, but a discussion of your architecture would be required to answer with a numeric value.

Q: In the monitoring tool, can detailed tracking be seen?
A: Yes, at the granularity of individual filesystem calls, setattr, mkdir, vget, getxattr, etc.

Q: What is the retention of the data that is behind the monitoring tools?
A: It’s configurable, the default is something like 1s resolution for 24 hours, and then it starts getting aggregated and resolution is reduced.

The next in person event in the US is still being planned. Stay tuned.