[gpfsug-discuss] RKM resilience questions testing and best practice

Fri Aug 18 10:09:38 BST 2023

Hmm.. IBM mentions in 5.1.2 documentation that for performance we could
just rotate the order of the keys to load balance the keys.. however
because of server maintenance I would imagine all the nodes end up on the
same server eventually.

But I think I see a solution.  If I just define 4 additional RKM configs
and each one with one key server and don't do anything else with it.  I am
guessing that GPFS is going to monitor and complain about them if they go
down.  And that is easy to test...

So RKM.conf with
RKM_PROD {
  kmipServerUri1 = node1
  kmipServerUri2 = node2
  kmipServerUri3 = node3
  kmipServerUri4 = node4
}
RKM_PROD_T1 {
  kmipServerUri = node1
}
RKM_PROD_T2 {
  kmipServerUri = node2
}
RKM_PROD_T3 {
  kmipServerUri = node3
}
RKM_PROD_T4 {
  kmipServerUri = node4
}

I could then define 4 files with a key from each test RKM_PROD_T? group to
monitor the availability of the individual key servers.

Call it Alec's trust but verify HA.

On Fri, Aug 18, 2023, 1:51 AM Alec <anacreo at gmail.com> wrote:

> Okay so how do you know the backup key servers are actually functioning
> until you try to fail to them?  We need a way to know they are actually
> working.
>
> Setting  encryptionKeyCacheExpiration to 0 would actually help in that we
> shouldn't go down once we are up.  But it would suck if we bounce and then
> find out none of the key servers are working, then we have the same
> disaster but just a different time to experience it.
>
> Spectrum Scale honestly needs an option to probe and complain about the
> backup RKM servers.   Or if we could run a command to validate that all
> keys are visible on all key servers that could work as well.
>
> Alec
>
> On Fri, Aug 18, 2023, 12:22 AM Jan-Frode Myklebust <janfrode at tanso.net>
> wrote:
>
>> If a key server go offline, scale will just go to the next one in the
>> list -- and give a warning/error about it in mmhealth. Nothing should
>> happen to the file system access. Also, you can tune how often scale needs
>> to refresh the keys from the key server with encryptionKeyCacheExpiration.
>> Setting it to 0 means that your nodes will only need to fetch the key when
>> they mount the file system, or when you change policy.
>>
>>
>>   -jf
>>
>> On Thu, Aug 17, 2023 at 5:54 PM Alec <anacreo at gmail.com> wrote:
>>
>>> Yesterday I proposed treating the replicated key servers as 2 different
>>> sets of servers.  And having scale address two of the RKM servers by one
>>> rkmid/tenant/devicegrp/client name, and having a second
>>> rkmid/tenant/devicegrp/client name for the 2nd set of servers.
>>>
>>> So define the same cluster of key management servers in two separate
>>> stanzas of RKM.conf, an upper and lower half.
>>>
>>> If we do that and key management team takes one set offline, everything
>>> should work but scale would think one set of keys are offline and scream.
>>>
>>> I think we need an IBM ticket to help vet all that out.
>>>
>>> Alec
>>>
>>> On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust <janfrode at tanso.net>
>>> wrote:
>>>
>>>>
>>>> Your second KMIP server don’t need to have an active replication
>>>> relationship with the first one — it just needs to contain the same MEK. So
>>>> you could do a one time replication / copying between them, and they would
>>>> not have to see each other anymore.
>>>>
>>>> I don’t think having them host different keys will work, as you won’t
>>>> be able to fetch the second key from the one server your client is
>>>> connected to, and then will be unable to encrypt with that key.
>>>>
>>>> From what I’ve seen of KMIP setups with Scale, it’s a stupidly trivial
>>>> service. It’s just a server that will tell you the key when asked + some
>>>> access control to make sure no one else gets it. Also MEKs never changes…
>>>> unless you actively change them in the file system policy, and then you
>>>> could just post the new key to all/both your independent key servers when
>>>> you do the change.
>>>>
>>>>
>>>>  -jf
>>>>
>>>> ons. 16. aug. 2023 kl. 23:25 skrev Alec <anacreo at gmail.com>:
>>>>
>>>>> Ed
>>>>>   Thanks for the response, I wasn't aware of those two commands.  I
>>>>> will see if that unlocks a solution. I kind of need the test to work in a
>>>>> production environment.   So can't just be adding spare nodes onto the
>>>>> cluster and forgetting with file systems.
>>>>>
>>>>> Unfortunately the logs don't indicate when a node has returned to
>>>>> health.  Only that it's in trouble but as we patch often we see these
>>>>> regularly.
>>>>>
>>>>>
>>>>> For the second question, we would add a 2nd MEK key to each file so
>>>>> that two independent keys from two different RKM pools would be able to
>>>>> unlock any file.  This would give us two whole independent paths to encrypt
>>>>> and decrypt a file.
>>>>>
>>>>> So I'm looking for a best practice example from IBM to indicate this
>>>>> so we don't have a dependency on a single RKM environment.
>>>>>
>>>>> Alec
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward <ewahl at osc.edu> wrote:
>>>>>
>>>>>> > How can we verify that a key server is up and running when there
>>>>>> are multiple key servers in an rkm pool serving a single key.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Pretty simple.
>>>>>>
>>>>>> -Grab a compute node/client (and mark it offline if needed) unmount
>>>>>> all encrypted File Systems.
>>>>>>
>>>>>> -Hack the RKM.conf to point to JUST the server you want to test (and
>>>>>> maybe a backup)
>>>>>>
>>>>>> -Clear all keys:   ‘/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ‘
>>>>>>
>>>>>> -Reload the RKM.conf:  ‘/usr/lpp/mmfs/bin/tsloadikm run’   (this is a
>>>>>> great command if you need to load new Certificates too)
>>>>>>
>>>>>> -Attempt to mount the encrypted FS, and then cat a few files.
>>>>>>
>>>>>>
>>>>>>
>>>>>> If you’ve not setup a 2nd server in your test you will see
>>>>>> quarantine messages in the logs for a bad KMIP server.    If it works, you
>>>>>> can clear keys again and see how many were retrieved.
>>>>>>
>>>>>>
>>>>>>
>>>>>> >Is there any documentation or diagram officially from IBM that
>>>>>> recommends having 2 keys from independent RKM environments for high
>>>>>> availability as best practice that I could refer to?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am not an IBM-er…  but I’m also not 100% sure what you are asking
>>>>>> here.   Two un-related SKLM setups? How would you sync the keys?   How
>>>>>> would this be better than multiple replicated servers?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ed Wahl
>>>>>>
>>>>>> Ohio Supercomputer Center
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> *On
>>>>>> Behalf Of *Alec
>>>>>> *Sent:* Wednesday, August 16, 2023 3:33 PM
>>>>>> *To:* gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
>>>>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and
>>>>>> best practice
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hello we are using a remote key server with GPFS I have two
>>>>>> questions: First question: How can we verify that a key server is up and
>>>>>> running when there are multiple key servers in an rkm pool serving a single
>>>>>> key. The scenario is after maintenance
>>>>>>
>>>>>> Hello we are using a remote key server with GPFS I have two questions:
>>>>>>
>>>>>>
>>>>>>
>>>>>> First question:
>>>>>>
>>>>>> How can we verify that a key server is up and running when there are
>>>>>> multiple key servers in an rkm pool serving a single key.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The scenario is after maintenance or periodically we want to verify
>>>>>> that all member of the pool are in service.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Second question is:
>>>>>>
>>>>>> Is there any documentation or diagram officially from IBM that
>>>>>> recommends having 2 keys from independent RKM environments for high
>>>>>> availability as best practice that I could refer to?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Alec
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gpfsug-discuss mailing list
>>>>>> gpfsug-discuss at gpfsug.org
>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>>>
>>>>> _______________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at gpfsug.org
>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230818/d4147513/attachment-0001.htm>