[gpfsug-discuss] gpfsug-discuss Digest, Vol 81, Issue 43

Wed Nov 20 15:36:16 GMT 2019

Lohit,

Did you get any progress on file system performance issue with large block size ?

Apparently we are also going through the same issue after we migrated to 16M file system (from 4MB filesystem). Only thing which gives some relief is when we increase pagepool to 32G. We also tried playing with prefetchAggressivenessRead parameter by reducing it to 1 but not sure if it is doing anything.

thanks

________________________________
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> on behalf of gpfsug-discuss-request at spectrumscale.org <gpfsug-discuss-request at spectrumscale.org>
Sent: Monday, October 22, 2018 12:18:58 PM
To: gpfsug-discuss at spectrumscale.org
Subject: gpfsug-discuss Digest, Vol 81, Issue 43

Send gpfsug-discuss mailing list submissions to
        gpfsug-discuss at spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
or, via email, send a message with subject or body 'help' to
        gpfsug-discuss-request at spectrumscale.org

You can reach the person managing the list at
        gpfsug-discuss-owner at spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."

Today's Topics:

   1. Re: GPFS, Pagepool and Block size -> Perfomance reduces with
      larger block size (Sven Oehme)

----------------------------------------------------------------------

Message: 1
Date: Mon, 22 Oct 2018 09:18:43 -0700
From: Sven Oehme <oehmes at gmail.com>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] GPFS, Pagepool and Block size ->
        Perfomance reduces with larger block size
Message-ID:
        <CALssuR18E=vLq_4V9eCeqbnqCNnwLi9JafrKXW45K-yS30DWrA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

oops, somehow that slipped my inbox, i only saw that reply right now.

its really hard to see from a trace snipped if the lock is the issue as the
lower level locks don't show up in default traces. without having access to
source code and a detailed trace you won't make much progress here.

sven

On Thu, Sep 27, 2018 at 12:31 PM <valleru at cbio.mskcc.org> wrote:

> Thank you Sven,
>
> Turning of prefetching did not improve the performance, but it did degrade
> a bit.
>
> I have made the prefetching default and took trace dump, for tracectl with
> trace=io. Let me know if you want me to paste/attach it here.
>
> May i know, how could i confirm if the below is true?
>
> 1. this could be serialization around buffer locks. as larger your
>>> blocksize gets as larger is the amount of data one of this pagepool buffers
>>> will maintain, if there is a lot of concurrency on smaller amount of data
>>> more threads potentially compete for the same buffer lock to copy stuff in
>>> and out of a particular buffer, hence things go slower compared to the same
>>> amount of data spread across more buffers, each of smaller size.
>>>
>>>
> Will the above trace help in understanding if it is a serialization issue?
>
> I had been discussing the same with GPFS support for past few months, and
> it seems to be that most of the time is being spent at cxiUXfer. They could
> not understand on why it is taking spending so much of time in cxiUXfer. I
> was seeing the same from perf top, and pagefaults.
>
> Below is snippet from what the support had said :
>
> ????????????????????????????
>
> I searched all of the gpfsRead from trace and sort them by spending-time.
> Except 2 reads which need fetch data from nsd server, the slowest read is
> in the thread 72170. It took 112470.362 us.
>
>
> trcrpt.2018-08-06_12.27.39.55538.lt15.trsum:   72165       6.860911319
> rdwr                   141857.076 us + NSDIO
>
> trcrpt.2018-08-06_12.26.28.39794.lt15.trsum:   72170       1.483947593
> rdwr                   112470.362 us + cxiUXfer
>
> trcrpt.2018-08-06_12.27.39.55538.lt15.trsum:   72165       6.949042593
> rdwr                    88126.278 us + NSDIO
>
> trcrpt.2018-08-06_12.27.03.47706.lt15.trsum:   72156       2.919334474
> rdwr                    81057.657 us + cxiUXfer
>
> trcrpt.2018-08-06_12.23.30.72745.lt15.trsum:   72154       1.167484466
> rdwr                    76033.488 us + cxiUXfer
>
> trcrpt.2018-08-06_12.24.06.7508.lt15.trsum:   72187       0.685237501
> rdwr                    70772.326 us + cxiUXfer
>
> trcrpt.2018-08-06_12.25.17.23989.lt15.trsum:   72193       4.757996530
> rdwr                    70447.838 us + cxiUXfer
>
>
> I check each of the slow IO as above, and find they all spend much time in
> the function cxiUXfer. This function is used to copy data from kernel
> buffer to user buffer. I am not sure why it took so much time. This should
> be related to the pagefaults and pgfree you observed. Below is the trace
> data for thread 72170.
>
>
>                    1.371477231  72170 TRACE_VNODE: gpfs_f_rdwr enter: fP
> 0xFFFF882541649400 f_flags 0x8000 flags 0x8001 op 0 iovec
> 0xFFFF881F2AFB3E70 count 1 offset 0x168F30D dentry 0xFFFF887C0CC298C0
> private 0xFFFF883F607175C0 iP 0xFFFF8823AA3CBFC0 name '410513.svs'
>
>               ....
>
>                    1.371483547  72170 TRACE_KSVFS: cachedReadFast exit:
> uio_resid 16777216 code 1 err 11
>
>               ....
>
>                    1.371498780  72170 TRACE_KSVFS: kSFSReadFast: oiP
> 0xFFFFC90060B46740 offset 0x168F30D dataBufP FFFFC9003645A5A8 nDesc 64 buf
> 200043C0000 valid words 64 dirty words 0 blkOff 0
>
>                    1.371499035  72170 TRACE_LOG:
> UpdateLogger::beginDataUpdate begin ul 0xFFFFC900333F1A40 holdCount 0
> ioType 0x2 inProg 0x15
>
>                    1.371500157  72170 TRACE_LOG:
> UpdateLogger::beginDataUpdate ul 0xFFFFC900333F1A40 holdCount 0 ioType 0x2
> inProg 0x16 err 0
>
>                    1.371500606  72170 TRACE_KSVFS: cxiUXfer: nDesc 64 1st
> dataPtr 0x200043C0000 plP 0xFFFF887F7B90D600 toIOBuf 0 offset 6877965 len
> 9899251
>
>                    1.371500793  72170 TRACE_KSVFS: cxiUXfer: ndesc 0 skip
> dataAddrP 0x200043C0000 currOffset 0 currLen 262144 bufOffset 6877965
>
>               ....
>
>                    1.371505949  72170 TRACE_KSVFS: cxiUXfer: ndesc 25 skip
> dataAddrP 0x2001AF80000 currOffset 6553600 currLen 262144 bufOffset 6877965
>
>                    1.371506236  72170 TRACE_KSVFS: cxiUXfer: nDesc 26
> currOffset 6815744 tmpLen 262144 dataAddrP 0x2001AFCF30D currLen 199923
> pageOffset 781 pageLen 3315 plP 0xFFFF887F7B90D600
>
>                    1.373649823  72170 TRACE_KSVFS: cxiUXfer: nDesc 27
> currOffset 7077888 tmpLen 262144 dataAddrP 0x20027400000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600
>
>                    1.375158799  72170 TRACE_KSVFS: cxiUXfer: nDesc 28
> currOffset 7340032 tmpLen 262144 dataAddrP 0x20027440000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600
>
>                    1.376661566  72170 TRACE_KSVFS: cxiUXfer: nDesc 29
> currOffset 7602176 tmpLen 262144 dataAddrP 0x2002C180000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600
>
>                    1.377892653  72170 TRACE_KSVFS: cxiUXfer: nDesc 30
> currOffset 7864320 tmpLen 262144 dataAddrP 0x2002C1C0000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600
>
>               ....
>
>                    1.471389843  72170 TRACE_KSVFS: cxiUXfer: nDesc 62
> currOffset 16252928 tmpLen 262144 dataAddrP 0x2001D2C0000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600
>
>                    1.471845629  72170 TRACE_KSVFS: cxiUXfer: nDesc 63
> currOffset 16515072 tmpLen 262144 dataAddrP 0x2003EC80000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600
>
>                    1.472417149  72170 TRACE_KSVFS: cxiDetachIOBuffer:
> dataPtr 0x200043C0000 plP 0xFFFF887F7B90D600
>
>                    1.472417775  72170 TRACE_LOCK: unlock_vfs: type Data,
> key 0000000000000004:000000001B1F24BF:0000000000000001 lock_mode have ro
> token xw lock_state old [ ro:27 ] new [ ro:26 ] holdCount now 27
>
>                    1.472418427  72170 TRACE_LOCK: hash tab lookup vfs:
> found cP 0xFFFFC9005FC0CDE0 holdCount now 14
>
>                    1.472418592  72170 TRACE_LOCK: lock_vfs: type Data key
> 0000000000000004:000000001B1F24BF:0000000000000002 lock_mode want ro status
> valid token xw/xw lock_state [ ro:12 ] flags 0x0 holdCount 14
>
>                    1.472419842  72170 TRACE_KSVFS: kSFSReadFast: oiP
> 0xFFFFC90060B46740 offset 0x2000000 dataBufP FFFFC9003643C908 nDesc 64 buf
> 38033480000 valid words 64 dirty words 0 blkOff 0
>
>                    1.472420029  72170 TRACE_LOG:
> UpdateLogger::beginDataUpdate begin ul 0xFFFFC9005FC0CF98 holdCount 0
> ioType 0x2 inProg 0xC
>
>                    1.472420187  72170 TRACE_LOG:
> UpdateLogger::beginDataUpdate ul 0xFFFFC9005FC0CF98 holdCount 0 ioType 0x2
> inProg 0xD err 0
>
>                    1.472420652  72170 TRACE_KSVFS: cxiUXfer: nDesc 64 1st
> dataPtr 0x38033480000 plP 0xFFFF887F7B934320 toIOBuf 0 offset 0 len 6877965
>
>                    1.472420936  72170 TRACE_KSVFS: cxiUXfer: nDesc 0
> currOffset 0 tmpLen 262144 dataAddrP 0x38033480000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320
>
>                    1.472824790  72170 TRACE_KSVFS: cxiUXfer: nDesc 1
> currOffset 262144 tmpLen 262144 dataAddrP 0x380334C0000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320
>
>                    1.473243905  72170 TRACE_KSVFS: cxiUXfer: nDesc 2
> currOffset 524288 tmpLen 262144 dataAddrP 0x38024280000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320
>
>               ....
>
>                    1.482949347  72170 TRACE_KSVFS: cxiUXfer: nDesc 24
> currOffset 6291456 tmpLen 262144 dataAddrP 0x38025E80000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320
>
>                    1.483354265  72170 TRACE_KSVFS: cxiUXfer: nDesc 25
> currOffset 6553600 tmpLen 262144 dataAddrP 0x38025EC0000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320
>
>                    1.483766631  72170 TRACE_KSVFS: cxiUXfer: nDesc 26
> currOffset 6815744 tmpLen 262144 dataAddrP 0x38003B00000 currLen 262144
> pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320
>
>                    1.483943894  72170 TRACE_KSVFS: cxiDetachIOBuffer:
> dataPtr 0x38033480000 plP 0xFFFF887F7B934320
>
>                    1.483944339  72170 TRACE_LOCK: unlock_vfs: type Data,
> key 0000000000000004:000000001B1F24BF:0000000000000002 lock_mode have ro
> token xw lock_state old [ ro:14 ] new [ ro:13 ] holdCount now 14
>
>                    1.483944683  72170 TRACE_BRL: brUnlockM: ofP
> 0xFFFFC90069346B68 inode 455025855 snap 0 handle 0xFFFFC9003637D020 range
> 0x168F30D-0x268F30C mode ro
>
>                    1.483944985  72170 TRACE_KSVFS: kSFSReadFast exit:
> uio_resid 0 err 0
>
>                    1.483945264  72170 TRACE_LOCK: unlock_vfs_m: type
> Inode, key 305F105B9701E60A:000000001B1F24BF:0000000000000000 lock_mode
> have ro status valid token rs lock_state old [ ro:25 ] new [ ro:24 ]
>
>                    1.483945423  72170 TRACE_LOCK: unlock_vfs_m: cP
> 0xFFFFC90069346B68 holdCount 25
>
>                    1.483945624  72170 TRACE_VNODE: gpfsRead exit: fast err
> 0
>
>                    1.483946831  72170 TRACE_KSVFS: ReleSG: sli 38 sgP
> 0xFFFFC90035E52F78 NotQuiesced vfsOp 2
>
>                    1.483946975  72170 TRACE_KSVFS: ReleSG: sli 38 sgP
> 0xFFFFC90035E52F78 vfsOp 2 users 1-1
>
>                    1.483947116  72170 TRACE_KSVFS: ReleaseDaemonSegAndSG:
> sli 38 count 2 needCleanup 0
>
>                    1.483947593  72170 TRACE_VNODE: gpfs_f_rdwr exit: fP
> 0xFFFF882541649400 total_len 16777216 uio_resid 0 offset 0x268F30D rc 0
>
>
> ???????????????????????????????????????????
>
>
>
> Regards,
> Lohit
>
> On Sep 19, 2018, 3:11 PM -0400, Sven Oehme <oehmes at gmail.com>, wrote:
>
> the document primarily explains all performance specific knobs. general
> advice would be to longer set anything beside workerthreads, pagepool and
> filecache on 5.X systems as most other settings are no longer relevant
> (thats a client side statement) . thats is true until you hit strange
> workloads , which is why all the knobs are still there :-)
>
> sven
>
>
> On Wed, Sep 19, 2018 at 11:17 AM <valleru at cbio.mskcc.org> wrote:
>
>> Thanks Sven.
>> I will disable it completely and see how it behaves.
>>
>> Is this the presentation?
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2014_UG10-5FGPFS-5FPerformance-5FSession-5Fv10.pdf&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=-Y1ncRDmgOCfWDQaXj-OcbiM5gcs0LcllAShfNapHtI&e=
>>
>> I guess i read it, but it did not strike me at this situation. I will try
>> to read it again and see if i could make use of it.
>>
>> Regards,
>> Lohit
>>
>> On Sep 19, 2018, 2:12 PM -0400, Sven Oehme <oehmes at gmail.com>, wrote:
>>
>> seem like you never read my performance presentation from a few years ago
>> ;-)
>>
>> you can control this on a per node basis , either for all i/o :
>>
>>    prefetchAggressiveness = X
>>
>> or individual for reads or writes :
>>
>>    prefetchAggressivenessRead = X
>>    prefetchAggressivenessWrite = X
>>
>> for a start i would turn it off completely via :
>>
>> mmchconfig prefetchAggressiveness=0 -I -N nodename
>>
>> that will turn it off only for that node and only until you restart the
>> node.
>> then see what happens
>>
>> sven
>>
>>
>> On Wed, Sep 19, 2018 at 11:07 AM <valleru at cbio.mskcc.org> wrote:
>>
>>> Thank you Sven.
>>>
>>> I mostly think it could be 1. or some other issue.
>>> I don?t think it could be 2. , because i can replicate this issue no
>>> matter what is the size of the dataset. It happens for few files that could
>>> easily fit in the page pool too.
>>>
>>> I do see a lot more page faults for 16M compared to 1M, so it could be
>>> related to many threads trying to compete for the same buffer space.
>>>
>>> I will try to take the trace with trace=io option and see if can find
>>> something.
>>>
>>> How do i turn of prefetching? Can i turn it off for a single
>>> node/client?
>>>
>>> Regards,
>>> Lohit
>>>
>>> On Sep 18, 2018, 5:23 PM -0400, Sven Oehme <oehmes at gmail.com>, wrote:
>>>
>>> Hi,
>>>
>>> taking a trace would tell for sure, but i suspect what you might be
>>> hitting one or even multiple issues which have similar negative performance
>>> impacts but different root causes.
>>>
>>> 1. this could be serialization around buffer locks. as larger your
>>> blocksize gets as larger is the amount of data one of this pagepool buffers
>>> will maintain, if there is a lot of concurrency on smaller amount of data
>>> more threads potentially compete for the same buffer lock to copy stuff in
>>> and out of a particular buffer, hence things go slower compared to the same
>>> amount of data spread across more buffers, each of smaller size.
>>>
>>> 2. your data set is small'ish, lets say a couple of time bigger than the
>>> pagepool and you random access it with multiple threads. what will happen
>>> is that because it doesn't fit into the cache it will be read from the
>>> backend. if multiple threads hit the same 16 mb block at once with multiple
>>> 4k random reads, it will read the whole 16mb block because it thinks it
>>> will benefit from it later on out of cache, but because it fully random the
>>> same happens with the next block and the next and so on and before you get
>>> back to this block it was pushed out of the cache because of lack of enough
>>> pagepool.
>>>
>>> i could think of multiple other scenarios , which is why its so hard to
>>> accurately benchmark an application because you will design a benchmark to
>>> test an application, but it actually almost always behaves different then
>>> you think it does :-)
>>>
>>> so best is to run the real application and see under which configuration
>>> it works best.
>>>
>>> you could also take a trace with trace=io and then look at
>>>
>>> TRACE_VNOP: READ:
>>> TRACE_VNOP: WRITE:
>>>
>>> and compare them to
>>>
>>> TRACE_IO: QIO: read
>>> TRACE_IO: QIO: write
>>>
>>> and see if the numbers summed up for both are somewhat equal. if
>>> TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o
>>> than you should and turning prefetching off might actually make things
>>> faster .
>>>
>>> keep in mind i am no longer working for IBM so all i say might be
>>> obsolete by now, i no longer have access to the one and only truth aka the
>>> source code ... but if i am wrong i am sure somebody will point this out
>>> soon ;-)
>>>
>>> sven
>>>
>>>
>>>
>>>
>>> On Tue, Sep 18, 2018 at 10:31 AM <valleru at cbio.mskcc.org> wrote:
>>>
>>>> Hello All,
>>>>
>>>> This is a continuation to the previous discussion that i had with Sven.
>>>> However against what i had mentioned previously - i realize that this
>>>> is ?not? related to mmap, and i see it when doing random freads.
>>>>
>>>> I see that block-size of the filesystem matters when reading from Page
>>>> pool.
>>>> I see a major difference in performance when compared 1M to 16M, when
>>>> doing lot of random small freads with all of the data in pagepool.
>>>>
>>>> Performance for 1M is a magnitude ?more? than the performance that i
>>>> see for 16M.
>>>>
>>>> The GPFS that we have currently is :
>>>> Version : 5.0.1-0.5
>>>> Filesystem version: 19.01 (5.0.1.0)
>>>> Block-size : 16M
>>>>
>>>> I had made the filesystem block-size to be 16M, thinking that i would
>>>> get the most performance for both random/sequential reads from 16M than the
>>>> smaller block-sizes.
>>>> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus
>>>> not loose lot of storage space even with 16M.
>>>> I had run few benchmarks and i did see that 16M was performing better
>>>> ?when hitting storage/disks? with respect to bandwidth for
>>>> random/sequential on small/large reads.
>>>>
>>>> However, with this particular workload - where it freads a chunk of
>>>> data randomly from hundreds of files -> I see that the number of
>>>> page-faults increase with block-size and actually reduce the performance.
>>>> 1M performs a lot better than 16M, and may be i will get better
>>>> performance with less than 1M.
>>>> It gives the best performance when reading from local disk, with 4K
>>>> block size filesystem.
>>>>
>>>> What i mean by performance when it comes to this workload - is not the
>>>> bandwidth but the amount of time that it takes to do each iteration/read
>>>> batch of data.
>>>>
>>>> I figure what is happening is:
>>>> fread is trying to read a full block size of 16M - which is good in a
>>>> way, when it hits the hard disk.
>>>> But the application could be using just a small part of that 16M. Thus
>>>> when randomly reading(freads) lot of data of 16M chunk size - it is page
>>>> faulting a lot more and causing the performance to drop .
>>>> I could try to make the application do read instead of freads, but i
>>>> fear that could be bad too since it might be hitting the disk with a very
>>>> small block size and that is not good.
>>>>
>>>> With the way i see things now -
>>>> I believe it could be best if the application does random reads of
>>>> 4k/1M from pagepool but some how does 16M from rotating disks.
>>>>
>>>> I don?t see any way of doing the above other than following a different
>>>> approach where i create a filesystem with a smaller block size ( 1M or less
>>>> than 1M ), on SSDs as a tier.
>>>>
>>>> May i please ask for advise, if what i am understanding/seeing is right
>>>> and the best solution possible for the above scenario.
>>>>
>>>> Regards,
>>>> Lohit
>>>>
>>>> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru <valleru at cbio.mskcc.org>,
>>>> wrote:
>>>>
>>>> Hey Sven,
>>>>
>>>> This is regarding mmap issues and GPFS.
>>>> We had discussed previously of experimenting with GPFS 5.
>>>>
>>>> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
>>>>
>>>> I am yet to experiment with mmap performance, but before that - I am
>>>> seeing weird hangs with GPFS 5 and I think it could be related to mmap.
>>>>
>>>> Have you seen GPFS ever hang on this syscall?
>>>> [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>]
>>>> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
>>>>
>>>> I see the above ,when kernel hangs and throws out a series of trace
>>>> calls.
>>>>
>>>> I somehow think the above trace is related to processes hanging on GPFS
>>>> forever. There are no errors in GPFS however.
>>>>
>>>> Also, I think the above happens only when the mmap threads go above a
>>>> particular number.
>>>>
>>>> We had faced a similar issue in 4.2.3 and it was resolved in a patch to
>>>> 4.2.3.2 . At that time , the issue happened when mmap threads go more than
>>>> worker1threads. According to the ticket - it was a mmap race condition that
>>>> GPFS was not handling well.
>>>>
>>>> I am not sure if this issue is a repeat and I am yet to isolate the
>>>> incident and test with increasing number of mmap threads.
>>>>
>>>> I am not 100 percent sure if this is related to mmap yet but just
>>>> wanted to ask you if you have seen anything like above.
>>>>
>>>> Thanks,
>>>>
>>>> Lohit
>>>>
>>>> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oehmes at gmail.com>, wrote:
>>>>
>>>> Hi Lohit,
>>>>
>>>> i am working with ray on a mmap performance improvement right now,
>>>> which most likely has the same root cause as yours , see -->
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_2018-2DJanuary_004411.html&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=AUEL847F_a-j6Y7t1fDMj4j33vLqvI6XrrNCVS5pUyA&e=
>>>> the thread above is silent after a couple of back and rorth, but ray
>>>> and i have active communication in the background and will repost as soon
>>>> as there is something new to share.
>>>> i am happy to look at this issue after we finish with ray's workload if
>>>> there is something missing, but first let's finish his, get you try the
>>>> same fix and see if there is something missing.
>>>>
>>>> btw. if people would share their use of MMAP , what applications they
>>>> use (home grown, just use lmdb which uses mmap under the cover, etc) please
>>>> let me know so i get a better picture on how wide the usage is with GPFS. i
>>>> know a lot of the ML/DL workloads are using it, but i would like to know
>>>> what else is out there i might not think about. feel free to drop me a
>>>> personal note, i might not reply to it right away, but eventually.
>>>>
>>>> thx. sven
>>>>
>>>>
>>>> On Thu, Feb 22, 2018 at 12:33 PM <valleru at cbio.mskcc.org> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I wanted to know, how does mmap interact with GPFS pagepool with
>>>>> respect to filesystem block-size?
>>>>> Does the efficiency depend on the mmap read size and the block-size of
>>>>> the filesystem even if all the data is cached in pagepool?
>>>>>
>>>>> GPFS 4.2.3.2 and CentOS7.
>>>>>
>>>>> Here is what i observed:
>>>>>
>>>>> I was testing a user script that uses mmap to read from 100M to 500MB
>>>>> files.
>>>>>
>>>>> The above files are stored on 3 different filesystems.
>>>>>
>>>>> Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
>>>>>
>>>>> 1. 4M block size GPFS filesystem, with separate metadata and data.
>>>>> Data on Near line and metadata on SSDs
>>>>> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the
>>>>> required files fully cached" from the above GPFS cluster as home. Data and
>>>>> Metadata together on SSDs
>>>>> 3. 16M block size GPFS filesystem, with separate metadata and data.
>>>>> Data on Near line and metadata on SSDs
>>>>>
>>>>> When i run the script first time for ?each" filesystem:
>>>>> I see that GPFS reads from the files, and caches into the pagepool as
>>>>> it reads, from mmdiag -- iohist
>>>>>
>>>>> When i run the second time, i see that there are no IO requests from
>>>>> the compute node to GPFS NSD servers, which is expected since all the data
>>>>> from the 3 filesystems is cached.
>>>>>
>>>>> However - the time taken for the script to run for the files in the 3
>>>>> different filesystems is different - although i know that they are just
>>>>> "mmapping"/reading from pagepool/cache and not from disk.
>>>>>
>>>>> Here is the difference in time, for IO just from pagepool:
>>>>>
>>>>> 20s 4M block size
>>>>> 15s 1M block size
>>>>> 40S 16M block size.
>>>>>
>>>>> Why do i see a difference when trying to mmap reads from different
>>>>> block-size filesystems, although i see that the IO requests are not hitting
>>>>> disks and just the pagepool?
>>>>>
>>>>> I am willing to share the strace output and mmdiag outputs if needed.
>>>>>
>>>>> Thanks,
>>>>> Lohit
>>>>>
>>>>> _______________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at spectrumscale.org
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>>>
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20181022_51ab71b5_attachment.html&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=JzHGuBpfIv_B-TTAx23rLQCg9j2f-8b_2KkQu5d2AD0&e=>

------------------------------

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=C9X8xNkG_lwP_-eFHTGejw&r=dOGWbX02X0lU3_xWy7sXmdsAk4pvqANuwZ0j-sV6OEo&m=ZkmRYnq8ro8C7ccXHnRvIVN4PzFuuQz-VKI8RiuM0ow&s=kDfJQ7W_JgLnvD_F6kwpIEFg_9j-Ain9f1uvMKFJD6s&e=

End of gpfsug-discuss Digest, Vol 81, Issue 43
**********************************************
________________________________
This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20191120/afe20de1/attachment-0001.htm>