<html><head></head><body><div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:16px"><div id="yiv7318490157"><div id="yui_3_16_0_ym19_1_1484899594252_25178"><div style="color:#000;background-color:#fff;font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:16px;" id="yui_3_16_0_ym19_1_1484899594252_25177"><div dir="ltr" id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17145"><font size="2"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17278" clear="none"></font><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416">Hi Mathias,<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17417" clear="none"></span></font></span><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416"></span></font></span><div id="yui_3_16_0_ym19_1_1484899594252_27130"><br></div><div id="yui_3_16_0_ym19_1_1484899594252_27131"><font id="yui_3_16_0_ym19_1_1484899594252_27132" size="2">It's OK when we remove the configuration file, the process doens't start.</font></div><div id="yui_3_16_0_ym19_1_1484899594252_27158"><font id="yui_3_16_0_ym19_1_1484899594252_27132" size="2"></font><br></div><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416">The problem occurs mainly with our compute nodes (all of them) and we don't use GUI and CES.<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17420" clear="none"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17421" clear="none">Ideed, I confirm we don't see performance impact with Linpack running on more than hundred nodes, it appears especially when there is a lot of communications wich is the case of our applications, our high speed network is based on Intel OmniPath Fabric.<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17424" clear="none"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17425" clear="none">We

 are seeing irregular iteration time every 30 sec. By Enabling 

HyperThreading, the issue is a little bit hidden but still there.<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17426" clear="none"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17427" clear="none">By using less cores per nodes (26 instead of 28), we don't see this behavior as if it needs one core for mmsysmon process.<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17428" clear="none"></span></font></span></div><div dir="ltr" id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_20451"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416"><br clear="none"></span></font></span></div><div dir="ltr" id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_20450"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416">I agree with you, might be good idea to open a PMR...<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17429" clear="none"></span></font></span></div><div dir="ltr" id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_20449"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416"><br clear="none"></span></font></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1484899594252_25295"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416">Please find below the output of mmhealth node show --verbose<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17430" clear="none"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17431" clear="none">Node status:             HEALTHY<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17432" clear="none"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17433" clear="none">Component                Status                   Reasons<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17434" clear="none">-------------------------------------------------------------------<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17435" clear="none">GPFS                     HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17436" clear="none">NETWORK                  HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17437" clear="none">  ib0                      HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17438" clear="none">FILESYSTEM               HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17439" clear="none">  gpfs1                    HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17440" clear="none">  gpfs2                    HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17441" clear="none">DISK                     HEALTHY                  -<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17442" clear="none"></span></font></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1484899594252_25340"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17280"><font id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17495" size="2"><span id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17416"><br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17443" clear="none">Thanks<br id="yiv7318490157yui_3_16_0_ym19_1_1484899594252_17444" clear="none">Farid</span></font></span></div> <div class="yiv7318490157qtdSeparateBR" id="yui_3_16_0_ym19_1_1484899594252_25341"><br clear="none"><br clear="none"></div><div class="yiv7318490157yqt1741576323" id="yiv7318490157yqt08921"></div></div></div></div><div class=".yiv7318490157yahoo_quoted"> <div style="font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:16px;"> <div style="font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:16px;"> <div dir="ltr"><font size="2" face="Arial"> Le Jeudi 19 janvier 2017 19h21, Simon Thompson (Research Computing - IT Services) <S.J.Thompson@bham.ac.uk> a écrit :<br clear="none"></font></div>  <br clear="none"><br clear="none"> <div class="yiv7318490157y_msg_container">On some of our nodes we were regularly seeing procees hung timeouts in dmesg from a python process, which I vaguely thought was related to the monitoring process (though we have other python bits from openstack running on these boxes). These are all running 4.2.2.0 code<br clear="none"><br clear="none">Simon<br clear="none">________________________________________<br clear="none">From: <a rel="nofollow" shape="rect" ymailto="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a> [<a rel="nofollow" shape="rect" ymailto="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a>] on behalf of Mathias Dietz [<a rel="nofollow" shape="rect" ymailto="mailto:MDIETZ@de.ibm.com" target="_blank" href="mailto:MDIETZ@de.ibm.com">MDIETZ@de.ibm.com</a>]<br clear="none">Sent: 19 January 2017 18:07<br clear="none">To: FC; gpfsug main discussion list<br clear="none">Subject: Re: [gpfsug-discuss] Bad performance with GPFS system monitoring (mmsysmon) in GPFS 4.2.1.1<br clear="none"><br clear="none">Hi Farid,<br clear="none"><br clear="none">there is no official way for disabling the system health monitoring because other components rely on it (e.g. GUI, CES, Install Toolkit,..)<br clear="none">If you are fine with the consequences you can just delete the mmsysmonitor.conf, which will prevent the monitor from starting.<br clear="none"><br clear="none">During our testing we did not see a significant performance impact caused by the monitoring.<br clear="none">In 4.2.2 some component monitors (e.g. disk) have been further improved to reduce polling and use notifications instead.<br clear="none"><br clear="none">Nevertheless, I would like to better understand what the issue is.<br clear="none">What kind of workload do you run ?<br clear="none">Do you see spikes in CPU usage every 30 seconds ?<br clear="none">Is it the same on all cluster nodes or just on some of them ?<br clear="none">Could you send us the output of "mmhealth node show -v" to see which monitors are active.<br clear="none"><br clear="none">It might make sense to open a PMR to get this issue fixed.<br clear="none"><br clear="none">Thanks.<br clear="none"><br clear="none"><br clear="none">Mit freundlichen Grüßen / Kind regards<br clear="none"><br clear="none">Mathias Dietz<br clear="none"><br clear="none">Spectrum Scale - Release Lead Architect (4.2.X Release)<br clear="none">System Health and Problem Determination Architect<br clear="none">IBM Certified Software Engineer<br clear="none"><br clear="none">----------------------------------------------------------------------------------------------------------<br clear="none">IBM Deutschland<br clear="none">Hechtsheimer Str. 2<br clear="none">55131 Mainz<br clear="none">Mobile: +49-15152801035<br clear="none">E-Mail: <a rel="nofollow" shape="rect" ymailto="mailto:mdietz@de.ibm.com" target="_blank" href="mailto:mdietz@de.ibm.com">mdietz@de.ibm.com</a><br clear="none">----------------------------------------------------------------------------------------------------------<br clear="none">IBM Deutschland Research & Development GmbH<br clear="none">Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk Wittkopp<br clear="none">Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294<br clear="none"><br clear="none"><br clear="none"><br clear="none"><br clear="none"><br clear="none">From:        FC <<a rel="nofollow" shape="rect" ymailto="mailto:farid.chabane@ymail.com" target="_blank" href="mailto:farid.chabane@ymail.com">farid.chabane@ymail.com</a>><br clear="none">To:        "<a rel="nofollow" shape="rect" ymailto="mailto:gpfsug-discuss@spectrumscale.org" target="_blank" href="mailto:gpfsug-discuss@spectrumscale.org">gpfsug-discuss@spectrumscale.org</a>" <<a rel="nofollow" shape="rect" ymailto="mailto:gpfsug-discuss@spectrumscale.org" target="_blank" href="mailto:gpfsug-discuss@spectrumscale.org">gpfsug-discuss@spectrumscale.org</a>><br clear="none">Date:        01/19/2017 07:06 AM<br clear="none">Subject:        [gpfsug-discuss] Bad performance with GPFS system monitoring (mmsysmon) in GPFS 4.2.1.1<br clear="none">Sent by:        <a rel="nofollow" shape="rect" ymailto="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a><br clear="none">________________________________<br clear="none"><br clear="none"><br clear="none"><br clear="none">Hi all,<br clear="none"><br clear="none">We are facing performance issues with some of our applications due to the GPFS system monitoring (mmsysmon) on CentOS 7.2.<br clear="none"><br clear="none">Bad performances (increase of iteration time) are seen every 30s exactly as the occurence frequency of mmsysmon ; the default monitor interval set to 30s in /var/mmfs/mmsysmon/mmsysmonitor.conf<br clear="none"><br clear="none">Shutting down GPFS with mmshutdown doesnt stop this process, we stopped it with the command mmsysmoncontrol and we get a stable iteration time.<br clear="none"><br clear="none">What are the impacts of disabling this process except losing access to mmhealth commands ?<br clear="none">Do you have an idea of a proper way to disable it for good without doing it in rc.local or increasing the monitoring interval in the configuration file ?<br clear="none"><br clear="none">Thanks,<br clear="none">Farid _______________________________________________<br clear="none">gpfsug-discuss mailing list<br clear="none">gpfsug-discuss at spectrumscale.org<br clear="none"><a rel="nofollow" shape="rect" target="_blank" href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><div class="yiv7318490157yqt2187318107" id="yiv7318490157yqtfd76357"><br clear="none"><br clear="none"><br clear="none"><br clear="none">_______________________________________________<br clear="none">gpfsug-discuss mailing list<br clear="none">gpfsug-discuss at spectrumscale.org<br clear="none"><a rel="nofollow" shape="rect" target="_blank" href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a></div><br clear="none"><br clear="none"></div>  </div> </div>  </div></div></body></html>