[gpfsug-discuss] pagepool shrink doesn't release all memory
Aaron Knister
aaron.s.knister at nasa.gov
Sun Feb 25 16:54:06 GMT 2018
Hi Stijn,
Thanks for sharing your experiences-- I'm glad I'm not the only one
whose had the idea (and come up empty handed).
About the pagpool and numa awareness, I'd remembered seeing something
about that somewhere and I did some googling and found there's a
parameter called numaMemoryInterleave that "starts mmfsd with numactl
--interleave=all". Do you think that provides the kind of numa awareness
you're looking for?
-Aaron
On 2/23/18 9:44 AM, Stijn De Weirdt wrote:
> hi all,
>
> we had the same idea long ago, afaik the issue we had was due to the
> pinned memory the pagepool uses when RDMA is enabled.
>
> at some point we restarted gpfs on the compute nodes for each job,
> similar to the way we do swapoff/swapon; but in certain scenarios gpfs
> really did not like it; so we gave up on it.
>
> the other issue that needs to be resolved is that the pagepool needs to
> be numa aware, so the pagepool is nicely allocated across all numa
> domains, instead of using the first ones available. otherwise compute
> jobs might start that only do non-local doamin memeory access.
>
> stijn
>
> On 02/23/2018 03:35 PM, IBM Spectrum Scale wrote:
>> AFAIK you can increase the pagepool size dynamically but you cannot shrink
>> it dynamically. To shrink it you must restart the GPFS daemon. Also,
>> could you please provide the actual pmap commands you executed?
>>
>> Regards, The Spectrum Scale (GPFS) team
>>
>> ------------------------------------------------------------------------------------------------------------------
>> If you feel that your question can benefit other users of Spectrum Scale
>> (GPFS), then please post it to the public IBM developerWroks Forum at
>> https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
>> .
>>
>> If your query concerns a potential software error in Spectrum Scale (GPFS)
>> and you have an IBM software maintenance contract please contact
>> 1-800-237-5511 in the United States or your local IBM Service Center in
>> other countries.
>>
>> The forum is informally monitored as time permits and should not be used
>> for priority messages to the Spectrum Scale (GPFS) team.
>>
>>
>>
>> From: Aaron Knister <aaron.s.knister at nasa.gov>
>> To: <gpfsug-discuss at spectrumscale.org>
>> Date: 02/22/2018 10:30 PM
>> Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all
>> memory
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>
>>
>>
>> This is also interesting (although I don't know what it really means).
>> Looking at pmap run against mmfsd I can see what happens after each step:
>>
>> # baseline
>> 00007fffe4639000 59164K 0K 0K 0K 0K ---p [anon]
>> 00007fffd837e000 61960K 0K 0K 0K 0K ---p [anon]
>> 0000020000000000 1048576K 1048576K 1048576K 1048576K 0K rwxp [anon]
>> Total: 1613580K 1191020K 1189650K 1171836K 0K
>>
>> # tschpool 64G
>> 00007fffe4639000 59164K 0K 0K 0K 0K ---p [anon]
>> 00007fffd837e000 61960K 0K 0K 0K 0K ---p [anon]
>> 0000020000000000 67108864K 67108864K 67108864K 67108864K 0K rwxp
>> [anon]
>> Total: 67706636K 67284108K 67282625K 67264920K 0K
>>
>> # tschpool 1G
>> 00007fffe4639000 59164K 0K 0K 0K 0K ---p [anon]
>> 00007fffd837e000 61960K 0K 0K 0K 0K ---p [anon]
>> 0000020001400000 139264K 139264K 139264K 139264K 0K rwxp [anon]
>> 0000020fc9400000 897024K 897024K 897024K 897024K 0K rwxp [anon]
>> 0000020009c00000 66052096K 0K 0K 0K 0K rwxp [anon]
>> Total: 67706636K 1223820K 1222451K 1204632K 0K
>>
>> Even though mmfsd has that 64G chunk allocated there's none of it
>> *used*. I wonder why Linux seems to be accounting it as allocated.
>>
>> -Aaron
>>
>> On 2/22/18 10:17 PM, Aaron Knister wrote:
>>> I've been exploring the idea for a while of writing a SLURM SPANK plugin
>>
>>> to allow users to dynamically change the pagepool size on a node. Every
>>> now and then we have some users who would benefit significantly from a
>>> much larger pagepool on compute nodes but by default keep it on the
>>> smaller side to make as much physmem available as possible to batch
>> work.
>>>
>>> In testing, though, it seems as though reducing the pagepool doesn't
>>> quite release all of the memory. I don't really understand it because
>>> I've never before seen memory that was previously resident become
>>> un-resident but still maintain the virtual memory allocation.
>>>
>>> Here's what I mean. Let's take a node with 128G and a 1G pagepool.
>>>
>>> If I do the following to simulate what might happen as various jobs
>>> tweak the pagepool:
>>>
>>> - tschpool 64G
>>> - tschpool 1G
>>> - tschpool 32G
>>> - tschpool 1G
>>> - tschpool 32G
>>>
>>> I end up with this:
>>>
>>> mmfsd thinks there's 32G resident but 64G virt
>>> # ps -o vsz,rss,comm -p 24397
>>> VSZ RSS COMMAND
>>> 67589400 33723236 mmfsd
>>>
>>> however, linux thinks there's ~100G used
>>>
>>> # free -g
>>> total used free shared buffers
>> cached
>>> Mem: 125 100 25 0 0
>> 0
>>> -/+ buffers/cache: 98 26
>>> Swap: 7 0 7
>>>
>>> I can jump back and forth between 1G and 32G *after* allocating 64G
>>> pagepool and the overall amount of memory in use doesn't balloon but I
>>> can't seem to shed that original 64G.
>>>
>>> I don't understand what's going on... :) Any ideas? This is with Scale
>>> 4.2.3.6.
>>>
>>> -Aaron
>>>
>>
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
More information about the gpfsug-discuss
mailing list