Module: check_mk
Branch: master
Commit: e3bfda90344c1e65c91c6ef583739a7020451da7
URL:
http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=e3bfda90344c1e…
Author: Mathias Kettner <mk(a)mathias-kettner.de>
Date: Mon Dec 3 18:39:54 2012 +0100
Added draft of predictive monitoring
---
doc/drafts/README.predictive | 225 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 225 insertions(+), 0 deletions(-)
diff --git a/doc/drafts/README.predictive b/doc/drafts/README.predictive
new file mode 100644
index 0000000..d9dea78
--- /dev/null
+++ b/doc/drafts/README.predictive
@@ -0,0 +1,225 @@
+PREDECTIVE MONITORING
+---------------------
+
+1) Introduction
+
+Some people call the following concept "predictive monitoring": Let's
assume
+that you have problems assigning levels for your CPU load on a specific server,
+because at certain times in a week an important CPU intense job is running.
+This jobs do produce a high load - which is completely OK. At other times -
+however - a high load could be a problem. If you now set the warn/crit levels such
+that the check does not trigger during the runs of the job - you make the monitoring
+blind to CPU load problems in the other periods.
+
+What we need is levels that change dynamically with the time, so that e.g. a load
+of 10 at monday 10:00 is OK while the same load at 11:00 should raise an alert.
+If those levels are then *automatically computed* from the values that have
+been measured in the past, then we get an intelligent monitoring that
"predicts"
+what is OK and what not. Other people also call this "anomaly detection".
+
+2) An idea for a solution
+
+Our idea is to use the data that are kept in RRDs in order to compute sensible
+dynamic levels for certain check parameters. Let's keep with the CPU load as an
+example. In any default Check_MK or OMD installation, PNP4Nagios and thus RRDs
+are used for storing historic performance data such as the CPU load for up to
+four years. This will be our basis. We will analyse this data from time to time
+and compute a forecast for the future.
+
+Before we can do this we need to understand, that the whole prediction idea
+is based on *periodic intervals* that repeat again and again. For example
+if we had a high CPU load on each of the last 50 mondays from 10:00 to 10:05
+then we assume that at the next monday we will have a similar development.
+But the day of the week might not always be the way to go. Here are some
+possible periods:
+
+* The day of the week
+* The day of the month (1st, 2nd, ...)
+* The day of the month reverse (last, 2nd last, ...)
+* The hour
+* Just whether its a work day or a holiday
+
+In general we need to make two decisions:
+* The slicing (for example one day)
+* The grouping (for example group by the day of the week)
+
+If we slice into days and group by the day of the week we get seven different
+groups. For each group we separately compute a prediction by fetching the
+relevant data from (a certain time horizon) of the past - for example the
+data of the last 20 mondays. Then we overlay these 20 graphs, compute for
+each time of day the
+
+- Maximum value
+- Minimum value
+- Average value
+- Standard deviation
+
+When doing this we could impose a larger weight on more recent mondays
+and a lesser weight on the other ones. The result is a condensed information
+about the past and at the same time a prediction of the future.
+
+Based on that prediction we can now construct dynamic levels by creating
+a "corridor". An example could be: "Alert a warning if the CPU load is
+more than 10% higher then the predicted value". A percentage is not the
+only way to go. We can use:
+
+- A +/- a percentage of the predicted value (average, min or max)
+- A absolute difference +/- of the predicted value (e.g. +/- 5)
+- A difference in relation to the standard deviation
+
+Working the the Standard deviation takes into account the difference
+between situations where the historic values show a greater or smaller
+variety. In other words: the smaller the standard deviation the more
+precise is the prediction.
+
+It is not only possible to set upper levels - also lower levels are
+possible. In other words: "Warn me, if the CPU load is too *low*!"
+This could be a hint for an important job that is *not* running.
+
+3) Implementation within Check_MK
+
+When trying to find a good architecture for an implementation several
+aspects have to taken into mind:
+
+- performance (used CPU/IO/disk ressources)
+- flexibility for the user
+- code complexity - and thus cost of implementation and code maintainance
+- transparency to the user
+- possibility of later improvements
+
+The implementation that we suggest tries to maximies all these aspects - while
+being sure that it is not impossible that even better ideas exist...
+
+The implementation touches various areas of Check_MK and consists of the followin
+tasks:
+
+a) A script/program that analyses RRD data and creates predicted dynamic levels
+b) A helper function for checks that makes use of that data
+c) Implementing dynamic levels in several checks by using that function
+d) Enhancing the WATO rules of those checks such that the user can configure
+ the dynamic levels
+e) Adapting the PNP templates of those checks such that the dynamic levels
+ are being displayed in the graphs.
+
+Implementation details:
+
+a) Analyser script
+
+This script needs the following input parameters:
+
+* Hostname and RRD and variable therein to analyse
+ (e.g. srvabc012 / CPU load / load1)
+
+* Slicing
+ (e.g. 24 hours, aligned to UTC)
+
+* Slice to compute [1]
+ (e.g. "monday")
+
+* Grouping [2]
+ (e.g. group by day of week)
+
+* Time horizon
+ (e.g. 20 weeks into the past)
+
+* Weight function
+ (e.g. the weight of each week is just 90% of the weight
+ of the succeessing week, or: weight all weeks identically)
+
+Notes:
+[1] If we just compute one slice at a time we can cut down the running time of
+ the script and can do this right within the check on a on-demand base.
+[2] The grouping can be implemented as a Python function that maps a time stamp (the
+ beginning of a slice) to a string that represents the group. E.g. 1763747600 ->
"monday".
+
+The result is a binary encoded file that is stored below var, e.g.:
+
+var/check_mk/prediction/srvabc012/CPU load/load1/monday
+
+A second file (Python repr() syntax) contains the input parameters including
+the time of youngest contained slice:
+
+var/check_mk/prediction/srvabc012/CPU load/load1/monday.info
+
+This info file allows to re-run the prediction is the check parameters have changed
+or if a new slice is needed.
+
+The implementation of this program is in Python, if this is fast enough (which I
+hope) or in C++ otherwise. If it is in Python then we do not need an external
+program but can put this into a module (just like currently snmp.py or automation.py).
+
+
+b) Helper function for checks
+
+When a check (e.g. cpu.loads) wants to apply dynamic levels it can call a helper
+function the encapsulates all of the intelligent stuff to 100%. A example
+call could be (the current hostname and checktype are implicitely known, the
+service description is being computed from the checktype and the item):
+
+analyser_params = {
+ "slicing" : (24 * 3600, 0), # duration in sec, offset from UTC
+ "slice" : "monday",
+ "grouping" : "weekday",
+ "horizon" : 24 * 3600 * 150, # 150 days back
+ "weight" : 0.95,
+}
+warn, crit = predict_levels("load1", "avg", "relative",
(0.05, 0.10), analyser_params)
+
+The function prototype looks like this:
+
+def predict_levels(ds_name, levels_rel, levels_op, levels, prediction_parameters,
item=None):
+ # Get current slice
+ # Check if prediction file is up-to-date
+ # if not, (re-)create prediction file
+ # get mmin, mmax, avg, stddev from prediction file for current time
+ # compute levels from that by applying levels_rel, levels_op and levels
+ # return levels
+
+
+c) Implementing dynamic levels in several checks
+
+From the variety of different checks it should be clear that there can be
+now generic way of enabling dynamic levels for *all* checks. So we need to
+concentrate on those checks where dynamic levels make sense. Such are for example:
+
+CPU Load
+CPU Utilization
+Memory usage (?)
+Used bandwidth on network ports
+Disk IO
+Kernel counters like context switches and process creations
+
+Those checks should get *additional* parameters for dynamic levels. That way the
+current configuration of those checks keeps compatible and you can impose an ultimate
+upper limit - regardless of dynamic computations.
+
+Some of those checks need to be converted from a simple tuple to a dictionary based
+configuration in order to do this. Here we must make sure that old tuple-based
+configurations are still supported.
+
+
+d) WATO rules for dynamic levels
+
+When the user is using dynamic levels for a check he needs to (or better: can) specify
+many parameters, as we have seen. All those parameters are mostly the same for
+all the different checks that support dynamic levels. We can create a helper function
+that makes it easier to declare such parameters in WATO rules.
+
+Checks that support upper *and* lower levels will get two sets of parameters, because
+the logic for upper and lower levels might differ. But the need to share the same
+parameters for the prediction generation (slicing, etc) so that one prediction file
+per check is sufficient.
+
+
+e) PNP Templates
+
+Makeing the actually predicted levels transparent to the user is a crucial point
+in the implementation. An easy way to do this is to simply add the predicted
+levels as additional performance values. The PNP templates of the affected
+checks need to detect the availability of those values on add nice lines to
+the graph.
+
+This is easy - while having one drawback: The user can see predicted levels
+not before they are actually applied. The advantage on the other hand is that
+if parameters for the prediction are changed the graph still correctly shows
+the levels that had been valid at each point of time in the past.