[gpfsug-discuss] sequential I/O write - performance

Mon Feb 26 16:58:24 GMT 2024

> Using iohist we found out that gpfs is overloading one dm-device (it

> took about 500ms to finish IOs). We replaced the "problematic"

> dm-device (as we have enough drives to play with) for new one but the

> overloading issue just jumped to another dm-device.

> > We believe that this behaviour is caused by the gpfs but we are

> > unable

> to locate the root cause of it.

>

> Hello,

> this behaviour could be caused by an assymmetry in data paths of your

> storage, relatively small imbalance can make request queue of a

> slightly slower disk grow seemingly unproportionally.

This problem is a real "blast from the past" for me. I saw similar behaviour a LONG time ago, and I think it is very possible that you might have a "preferred paths" issue with your NSD servers to your target drives. If you are using Scale talking to a Storage System which has multiple paths to the device, and multiple Scale NSD servers can see the same LUN (which is correct from availability) then you can in some cases get exactly this sort of behaviour.

I am guessing you are running "LAN free" architecture, with many servers doing direct I/O to the NSDs/LUNs. Not doing Scale Client -> Scale NSD server -> NSD/LUN

I'll bet you see low I/O rates and long latencies to the "problem" NSD/LUN/drive.

The 500ms I/O delay can be the target NSD/LUN being switched from being "owned" by one of the controllers in the storage system to the other.

I can't see how Scale can do anything to make a device take 500 ms to complete an I/O when tracked by IOHIST at the OS level - because you are clearly not able to drive a lot of throughput to the devices, so it can't bethat  device overloading is causing a long queue on the device. There is something else happening. Not at Scale, not at the device, it is somewhere in whatever network or SAN is between the Scale NSD server and the NSD device. Something is trying to do recovery.

Say your Scale NSD servers sends an I/O to a target NSD and it goes to Storage controller A. Then another Scale NSD server sends an I/O to the same target NSD and instead of it going via a path that leads it to Storage system  controller A it goes to Storage controller B. At that point the storage system says "Oh it looks like future I/O will be coming into Storage controller B, let's switch the internal ownership to B. OK, we need to flush write caches and do some other things. That will take about 500 ms."

Then an I/O goes to Storage System controller A, and you get another switch back of the LUN from B to A. Another 500 ms.

The drive is being "ping pong-ed" from one Storage System controller to another. Because there are I/Os randomly coming on to the drive to one or other Storage Controller.

You need to make sure that all NSD servers access each LUN using the same default path to the same Storage System controller. There is a way to do this, to choose the "preferred path" unless that path is down. Could be that some servers can't use the "preferred path"?

This will probably only happen if you have something running on the actual Scale NSD servers that is accessing the filesystem, otherwise Scale Clients will always go across the Scale network to the current Primary NSD server to get to an NSD.

Or there is some other problem causing a "ping pong" effect. But it sounds like a "ping pong" to me, especially because when you replaced the dm device the problem moved elsewhere.

Regards,

Indulis Bernsteins
Storage Architect, IBM Worldwide Technical Sales

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20240226/9bbff522/attachment.htm>