[gpfsug-discuss] pagepool

Fri Mar 8 16:32:57 GMT 2024

Yikes!  Those must be some mighty large memory compute nodes!   That is an OK setting for a large memory ESS/DSS server but NOT the compute nodes at my site, as that is in bytes.
(so ~324 GB)  Even on our 1TB+ memory machines we do not tune it that high.

You can set pagepool for nodeclass machines such as all your compute, but pagepool is one of those settings where you will have to restart the clients for it to take effect. (such as most all the rdma settings, etc)
You should look into creating a “nodeclass” for each of your “node types” if you have not already, so you can avoid OOM issues from just the pagepool, and tune other settings per node-type (rdma/network settings, etc)
I would address this here, rather than on the Slurm side.   Then you can address (total memory minus the pagepool) for the overall addressability to Slurm for user jobs.  Leave some spare memory for the system itself or you will see more memory issues and whatnot when users get close to OOM, even in their cgroup.

Example from a cross mounted compute-side cluster.  Default is 1GB:
[root at nostorage-manager1 ~]# mmlsconfig pagepool
pagepool 1024M
pagepool 4G [k8,pitzer]
pagepool 64G [ascend]
pagepool 16G [ib-spire-login,owenslogin,pitzerlogin]
pagepool 48G [dm]
pagepool 4G [cardinal]
pagepool 64G [cardinal_quadport]

example from the ESS/DSS server side.  Later ESS versions set things by mmvdisk groups, rather than server type.
# mmlsconfig pagepool
pagepool 32G
pagepool 358G [gss_ppc64]
pagepool 16384M [ibmems11-hs,ems]
pagepool 324383477760 [ess3200_mmvdisk_ibmessio13_hs_ibmessio14_hs,ess3200_mmvdisk_ibmessio15_hs_ibmessio16_hs,ess3200_mmvdisk_ibmessio17_hs_ibmessio18_hs]
pagepool 64G [sp]
pagepool 384399572992 [ibmgssio1_hsibmgssio2_hs,ibmgssio3_hsibmgssio4_hs,ibmgssio5_hsibmgssio6_hs]
pagepool 573475966156 [ess5k_mmvdisk_ibmessio11_hs_ibmessio12_hs]
pagepool 96G [ces]

example of nodeclasses used to address other settings, such as what Infiniband port(s) to use.
# mmlsconfig verbsports
verbsPorts mlx5_0
verbsPorts mlx5_0 mlx5_2 [pitzer_dualport]
verbsPorts mlx4_1/1 mlx4_1/2 [dm]
verbsPorts mlx5_0 mlx5_2 [k8_dualport]
verbsPorts mlx5_0 mlx5_1 mlx5_2 mlx5_3 [cardinal_quadport]

Ed Wahl
Ohio Supercomputer Center
From: gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> On Behalf Of Iban Cabrillo
Sent: Friday, March 8, 2024 9:40 AM
To: gpfsug-discuss <gpfsug-discuss at spectrumscale.org>
Subject: [gpfsug-discuss] pagepool

Good afternoon, We are new to the DSS system configurations. Reviewing the configuration I have seen that the default pagepool is set to this value: pagepool 323908133683 But not only in the DSS servers, but also in the rest of the HPC nodes

Good afternoon,
   We are new to the DSS system configurations. Reviewing the configuration I have seen that the default pagepool is set to this value:

    pagepool 323908133683

But not only in the DSS servers, but also in the rest of the HPC nodes and I don't know if it is an excessive value. We are noticing that some jobs are dying by "Memory cgroup out of memory: Killed process XXX", and my doubt is if this pagepool is reserving too much memory for the mmfs process in decripento of the execution of jobs.

Any advice is welcomed,

Regards, I
--

================================================================
  Ibán Cabrillo Bartolomé
  Instituto de Física de Cantabria (IFCA-CSIC)
  Santander, Spain
  Tel: +34942200969/+34669930421
  Responsible for advanced computing service (RSC)
=========================================================================================
=========================================================================================
All our suppliers must know and accept IFCA policy available at:

https://confluence.ifca.es/display/IC/Information+Security+Policy+for+External+Suppliers<https://urldefense.com/v3/__https:/confluence.ifca.es/display/IC/Information*Security*Policy*for*External*Suppliers__;KysrKys!!KGKeukY!3o_dGRsvxDtOG6Z646nJEb9ehb_ondS1kL3gecKjKN7mvMULc6h9iKST-ihDjnWz04X-lcNATjPzLDB2eW7P$>
==========================================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20240308/0117d34e/attachment-0003.htm>