<div dir="ltr"><div><div><div><div><div><div>Hello All,<br><br></div>In a recent 

Spectrum Scale performance study, we used zimon/mmperfmon to gather 

metrics. During a period of 2 months, we ended up losing data twice from

 the zimon database; once after the virtual disk serving both the OS 

files and zimon collector and DB storage was resized, and a second time 

after an unknown event (the loss was discovered when plotting in Grafana

 only went back to a certain data and time; likewise, mmperfmon query 

output only went back to the same time).<br><br></div>Details:<br></div>- Spectrum Scale 4.2.1.1 (on NSD servers); 4.2.1.2 on the zimon collector node and other clients<br></div><div>-

 Data retention in the "raw" stratum was set to 2 months; the "domains" 

settings were as follows (note that we did not hit the ceiling of 60GB 

(1GB/file * 60 files):</div><div><br></div><div>domains = {<br>        # this is the raw domain<br>        aggregation = 0         # aggregation factor for the raw domain is always 0.<br>        ram = "12g"             # amount of RAM to be used<br>        duration = "2m"         # amount of time that data with the highest precision is kept.<br>        filesize = "1g"         # maximum file size<br>        files = 60              # number of files.<br>},<br>{<br>        # this is the first aggregation domain that aggregates to 10 seconds<br>        aggregation = 10<br>        ram = "800m"            # amount of RAM to be used<br>        duration = "6m"         # keep aggregates for 1 week.<br>        filesize = "1g"         # maximum file size<br>        files = 10              # number of files.<br>},<br>{<br>        # this is the second aggregation domain that aggregates to 30*10 seconds == 5 minutes<br>        aggregation = 30<br>        ram = "800m"            # amount of RAM to be used<br>        duration = "1y"         # keep averages for 2 months.<br>        filesize = "1g"         # maximum file size<br>        files = 5               # number of files.<br>},<br>{<br>        # this is the third aggregation domain that aggregates to 24*30*10 seconds == 2 hours<br>        aggregation = 24<br>        ram = "800m"            # amount of RAM to be used<br>        duration = "2y"         #<br>        filesize = "1g"         # maximum file size<br>        files = 5               # number of files.<br>}<br><br></div><div><br></div>Questions:<br><br></div>1.) Has anyone had similar issues with losing data from zimon?</div><div><br></div><div>2.)

 Are there known circumstances where data could be lost, e.g. changing 

the aggregation domain definitions, or even simply restarting the zimon 

collector?</div><div><br></div><div>3.) Does anyone have any "best 

practices" for backing up the zimon database? We were taking weekly 

"snapshots" by shutting down the collector, and making a tarball copy of

 the /opt/ibm/zimon directory (but the database corruption/data loss 

still crept through for various reasons).</div><div><br></div><div><br></div>In

 terms of debugging, we do not have Scale or zimon logs going back to 

the suspected dates of data loss; we do have a gpfs.snap from about a 

month after the last data loss - would it have any useful clues? Opening

 a PMR could be tricky, as it was the customer who has the support 

entitlement, and the environment (specifically the old cluster 

definitino and the zimon collector VM) was torn down.<br clear="all"><div><br></div><div><br></div><div>Many Thanks,</div><div>  Keith</div><div><br></div>-- <br>Keith D. Ball, PhD<br><div><div>RedLine Performance Solutions, LLC</div><div>web:  <a href="http://www.redlineperf.com/" target="_blank">http://www.redlineperf.com/</a><br><div>email: <a href="mailto:aqualkenbush@redlineperf.com" target="_blank">kball@redlineperf.com</a></div></div></div>cell: <a href="tel:%28540%29%20557-7851" value="+15405577851" target="_blank">540-557-7851</a></div>