<div dir="ltr"><div><div><div><div><div><div>Hello All,<br><br></div>In a recent
Spectrum Scale performance study, we used zimon/mmperfmon to gather
metrics. During a period of 2 months, we ended up losing data twice from
the zimon database; once after the virtual disk serving both the OS
files and zimon collector and DB storage was resized, and a second time
after an unknown event (the loss was discovered when plotting in Grafana
only went back to a certain data and time; likewise, mmperfmon query
output only went back to the same time).<br><br></div>Details:<br></div>- Spectrum Scale 4.2.1.1 (on NSD servers); 4.2.1.2 on the zimon collector node and other clients<br></div><div>-
Data retention in the "raw" stratum was set to 2 months; the "domains"
settings were as follows (note that we did not hit the ceiling of 60GB
(1GB/file * 60 files):</div><div><br></div><div>domains = {<br> # this is the raw domain<br> aggregation = 0 # aggregation factor for the raw domain is always 0.<br> ram = "12g" # amount of RAM to be used<br> duration = "2m" # amount of time that data with the highest precision is kept.<br> filesize = "1g" # maximum file size<br> files = 60 # number of files.<br>},<br>{<br> # this is the first aggregation domain that aggregates to 10 seconds<br> aggregation = 10<br> ram = "800m" # amount of RAM to be used<br> duration = "6m" # keep aggregates for 1 week.<br> filesize = "1g" # maximum file size<br> files = 10 # number of files.<br>},<br>{<br> # this is the second aggregation domain that aggregates to 30*10 seconds == 5 minutes<br> aggregation = 30<br> ram = "800m" # amount of RAM to be used<br> duration = "1y" # keep averages for 2 months.<br> filesize = "1g" # maximum file size<br> files = 5 # number of files.<br>},<br>{<br> # this is the third aggregation domain that aggregates to 24*30*10 seconds == 2 hours<br> aggregation = 24<br> ram = "800m" # amount of RAM to be used<br> duration = "2y" #<br> filesize = "1g" # maximum file size<br> files = 5 # number of files.<br>}<br><br></div><div><br></div>Questions:<br><br></div>1.) Has anyone had similar issues with losing data from zimon?</div><div><br></div><div>2.)
Are there known circumstances where data could be lost, e.g. changing
the aggregation domain definitions, or even simply restarting the zimon
collector?</div><div><br></div><div>3.) Does anyone have any "best
practices" for backing up the zimon database? We were taking weekly
"snapshots" by shutting down the collector, and making a tarball copy of
the /opt/ibm/zimon directory (but the database corruption/data loss
still crept through for various reasons).</div><div><br></div><div><br></div>In
terms of debugging, we do not have Scale or zimon logs going back to
the suspected dates of data loss; we do have a gpfs.snap from about a
month after the last data loss - would it have any useful clues? Opening
a PMR could be tricky, as it was the customer who has the support
entitlement, and the environment (specifically the old cluster
definitino and the zimon collector VM) was torn down.<br clear="all"><div><br></div><div><br></div><div>Many Thanks,</div><div> Keith</div><div><br></div>-- <br>Keith D. Ball, PhD<br><div><div>RedLine Performance Solutions, LLC</div><div>web: <a href="http://www.redlineperf.com/" target="_blank">http://www.redlineperf.com/</a><br><div>email: <a href="mailto:aqualkenbush@redlineperf.com" target="_blank">kball@redlineperf.com</a></div></div></div>cell: <a href="tel:%28540%29%20557-7851" value="+15405577851" target="_blank">540-557-7851</a></div>