Module: check_mk
Branch: master
Commit: 6753c3c70ceda0225572022871fbb97d8be5aea9
URL: http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=6753c3c70ceda0…
Author: Mathias Kettner <mk(a)mathias-kettner.de>
Date: Tue Dec 4 09:37:05 2012 +0100
Added draft for server inventory (German)
---
doc/drafts/LIESMICH.inventur | 58 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 58 insertions(+), 0 deletions(-)
diff --git a/doc/drafts/LIESMICH.inventur b/doc/drafts/LIESMICH.inventur
new file mode 100644
index 0000000..7981f1a
--- /dev/null
+++ b/doc/drafts/LIESMICH.inventur
@@ -0,0 +1,58 @@
+SERVER-INVENTURISIERUNG
+-----------------------
+
+(aus der Antwort aus einer Email)
+
+in der Tat haben wir über das Thema diskutiert und ich habe mir auch
+einige Gedanken dazu gemacht. Die Ideen aus dem LIESMICH.interval sind
+nicht zuletzt deswegen entstanden.
+
+Das Konzept, dass mir vorschwebt, ist allerdings nicht, einfach Check_MK
+als Transportmechnismus für ein bestehenden Inventurisierungsskript zu
+nehmen, sondern die vorhandenen Mittel auszureizen. Das Beispiel, das Sie
+mir gemailt haben, zeigt den Grund: der aktuelle Agent sendet bereits heute
+einen Großteil der Daten. Wenn man z.B. unter Linux noch ergänzt:
+
+/proc/cpuinfo
+rpm -qa
+lspci
+dmidecode
+
+Dann hat man fast alles, was was man braucht. Die Idee ist wie beim Monitoring
+mit Check_MK, dass der Agent die Daten nicht vorauswertet - also z.B. nicht
+selbst in der Ausgabe von lspci nach einer Soundkarte sucht - sondern dass
+man das im zentralen Check_MK macht. Das ist effizienter, flexibler, leichter
+änderbar und sorgt vor allem für einen viel einfacheren Agenten.
+
+Was man also tun müsste wäre:
+
+* Agenten um eine Handvoll Plugins erweitern, dabei eventuelle Langläufer
+mit größeren Intervall abfedern. Im obigen Beispiel ist das evtl. noch
+nicht mal notwendig.
+
+* Inventur-basierte Parser für die vorhandenen und neuen relevanten Sektionen
+schreiben. Die extrahieren dann Daten und gliedern sie in einen strukturierten
+Baum ein. Der Baum wird pro Host in eine Daten geschrieben.
+
+* Die Check_MK Kommandozeile um Befehle zur Inventur erweitern.
+
+* Auch im WATO eine Bedienung der Inventur ermöglichen - z.B.
+direktes antriggern. Evtl. ist die Inventur aber einfach als aktiver
+Check realisiert. Damit könnte man die Nagios-Funktionen direkt nutzen
+(z.B. Reschedule).
+
+* In der Multisite-GUI Seiten, mit denen das schön angezeigt werden
+kann. Evtl. verwendet man die Tabellenfunktionen, die es aktuell schon
+gibt. Dadurch könnte man alles nutzen, wie Filter/Suchfunktionen, Sortierung,
+Gruppierung, Freie Spaltenauswahl, Export in JSON, etc. und könnte die
+Daten auch sofort mit Monitoringdaten verknüpfen. Evtl. dann noch ein
+Webservice für einen XML-Export.
+
+* Das ganze im Rahmen des verteilten Monitorings auch umsetzen, also Zugriff
+über Multisite auf Inventurdaten, die auf einem anderen Host liegen -
+oder Synchronisation der Daten.
+
+* Und - am schwierigsten - einen guten Begriff für das ganze finden. Denn
+Check_MK verwendet den Begriff "Inventur" bereits für das automatische
+Einrichten von Services....
+
Module: check_mk
Branch: master
Commit: e3bfda90344c1e65c91c6ef583739a7020451da7
URL: http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=e3bfda90344c1e…
Author: Mathias Kettner <mk(a)mathias-kettner.de>
Date: Mon Dec 3 18:39:54 2012 +0100
Added draft of predictive monitoring
---
doc/drafts/README.predictive | 225 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 225 insertions(+), 0 deletions(-)
diff --git a/doc/drafts/README.predictive b/doc/drafts/README.predictive
new file mode 100644
index 0000000..d9dea78
--- /dev/null
+++ b/doc/drafts/README.predictive
@@ -0,0 +1,225 @@
+PREDECTIVE MONITORING
+---------------------
+
+1) Introduction
+
+Some people call the following concept "predictive monitoring": Let's assume
+that you have problems assigning levels for your CPU load on a specific server,
+because at certain times in a week an important CPU intense job is running.
+This jobs do produce a high load - which is completely OK. At other times -
+however - a high load could be a problem. If you now set the warn/crit levels such
+that the check does not trigger during the runs of the job - you make the monitoring
+blind to CPU load problems in the other periods.
+
+What we need is levels that change dynamically with the time, so that e.g. a load
+of 10 at monday 10:00 is OK while the same load at 11:00 should raise an alert.
+If those levels are then *automatically computed* from the values that have
+been measured in the past, then we get an intelligent monitoring that "predicts"
+what is OK and what not. Other people also call this "anomaly detection".
+
+2) An idea for a solution
+
+Our idea is to use the data that are kept in RRDs in order to compute sensible
+dynamic levels for certain check parameters. Let's keep with the CPU load as an
+example. In any default Check_MK or OMD installation, PNP4Nagios and thus RRDs
+are used for storing historic performance data such as the CPU load for up to
+four years. This will be our basis. We will analyse this data from time to time
+and compute a forecast for the future.
+
+Before we can do this we need to understand, that the whole prediction idea
+is based on *periodic intervals* that repeat again and again. For example
+if we had a high CPU load on each of the last 50 mondays from 10:00 to 10:05
+then we assume that at the next monday we will have a similar development.
+But the day of the week might not always be the way to go. Here are some
+possible periods:
+
+* The day of the week
+* The day of the month (1st, 2nd, ...)
+* The day of the month reverse (last, 2nd last, ...)
+* The hour
+* Just whether its a work day or a holiday
+
+In general we need to make two decisions:
+* The slicing (for example one day)
+* The grouping (for example group by the day of the week)
+
+If we slice into days and group by the day of the week we get seven different
+groups. For each group we separately compute a prediction by fetching the
+relevant data from (a certain time horizon) of the past - for example the
+data of the last 20 mondays. Then we overlay these 20 graphs, compute for
+each time of day the
+
+- Maximum value
+- Minimum value
+- Average value
+- Standard deviation
+
+When doing this we could impose a larger weight on more recent mondays
+and a lesser weight on the other ones. The result is a condensed information
+about the past and at the same time a prediction of the future.
+
+Based on that prediction we can now construct dynamic levels by creating
+a "corridor". An example could be: "Alert a warning if the CPU load is
+more than 10% higher then the predicted value". A percentage is not the
+only way to go. We can use:
+
+- A +/- a percentage of the predicted value (average, min or max)
+- A absolute difference +/- of the predicted value (e.g. +/- 5)
+- A difference in relation to the standard deviation
+
+Working the the Standard deviation takes into account the difference
+between situations where the historic values show a greater or smaller
+variety. In other words: the smaller the standard deviation the more
+precise is the prediction.
+
+It is not only possible to set upper levels - also lower levels are
+possible. In other words: "Warn me, if the CPU load is too *low*!"
+This could be a hint for an important job that is *not* running.
+
+3) Implementation within Check_MK
+
+When trying to find a good architecture for an implementation several
+aspects have to taken into mind:
+
+- performance (used CPU/IO/disk ressources)
+- flexibility for the user
+- code complexity - and thus cost of implementation and code maintainance
+- transparency to the user
+- possibility of later improvements
+
+The implementation that we suggest tries to maximies all these aspects - while
+being sure that it is not impossible that even better ideas exist...
+
+The implementation touches various areas of Check_MK and consists of the followin
+tasks:
+
+a) A script/program that analyses RRD data and creates predicted dynamic levels
+b) A helper function for checks that makes use of that data
+c) Implementing dynamic levels in several checks by using that function
+d) Enhancing the WATO rules of those checks such that the user can configure
+ the dynamic levels
+e) Adapting the PNP templates of those checks such that the dynamic levels
+ are being displayed in the graphs.
+
+Implementation details:
+
+a) Analyser script
+
+This script needs the following input parameters:
+
+* Hostname and RRD and variable therein to analyse
+ (e.g. srvabc012 / CPU load / load1)
+
+* Slicing
+ (e.g. 24 hours, aligned to UTC)
+
+* Slice to compute [1]
+ (e.g. "monday")
+
+* Grouping [2]
+ (e.g. group by day of week)
+
+* Time horizon
+ (e.g. 20 weeks into the past)
+
+* Weight function
+ (e.g. the weight of each week is just 90% of the weight
+ of the succeessing week, or: weight all weeks identically)
+
+Notes:
+[1] If we just compute one slice at a time we can cut down the running time of
+ the script and can do this right within the check on a on-demand base.
+[2] The grouping can be implemented as a Python function that maps a time stamp (the
+ beginning of a slice) to a string that represents the group. E.g. 1763747600 -> "monday".
+
+The result is a binary encoded file that is stored below var, e.g.:
+
+var/check_mk/prediction/srvabc012/CPU load/load1/monday
+
+A second file (Python repr() syntax) contains the input parameters including
+the time of youngest contained slice:
+
+var/check_mk/prediction/srvabc012/CPU load/load1/monday.info
+
+This info file allows to re-run the prediction is the check parameters have changed
+or if a new slice is needed.
+
+The implementation of this program is in Python, if this is fast enough (which I
+hope) or in C++ otherwise. If it is in Python then we do not need an external
+program but can put this into a module (just like currently snmp.py or automation.py).
+
+
+b) Helper function for checks
+
+When a check (e.g. cpu.loads) wants to apply dynamic levels it can call a helper
+function the encapsulates all of the intelligent stuff to 100%. A example
+call could be (the current hostname and checktype are implicitely known, the
+service description is being computed from the checktype and the item):
+
+analyser_params = {
+ "slicing" : (24 * 3600, 0), # duration in sec, offset from UTC
+ "slice" : "monday",
+ "grouping" : "weekday",
+ "horizon" : 24 * 3600 * 150, # 150 days back
+ "weight" : 0.95,
+}
+warn, crit = predict_levels("load1", "avg", "relative", (0.05, 0.10), analyser_params)
+
+The function prototype looks like this:
+
+def predict_levels(ds_name, levels_rel, levels_op, levels, prediction_parameters, item=None):
+ # Get current slice
+ # Check if prediction file is up-to-date
+ # if not, (re-)create prediction file
+ # get mmin, mmax, avg, stddev from prediction file for current time
+ # compute levels from that by applying levels_rel, levels_op and levels
+ # return levels
+
+
+c) Implementing dynamic levels in several checks
+
+From the variety of different checks it should be clear that there can be
+now generic way of enabling dynamic levels for *all* checks. So we need to
+concentrate on those checks where dynamic levels make sense. Such are for example:
+
+CPU Load
+CPU Utilization
+Memory usage (?)
+Used bandwidth on network ports
+Disk IO
+Kernel counters like context switches and process creations
+
+Those checks should get *additional* parameters for dynamic levels. That way the
+current configuration of those checks keeps compatible and you can impose an ultimate
+upper limit - regardless of dynamic computations.
+
+Some of those checks need to be converted from a simple tuple to a dictionary based
+configuration in order to do this. Here we must make sure that old tuple-based
+configurations are still supported.
+
+
+d) WATO rules for dynamic levels
+
+When the user is using dynamic levels for a check he needs to (or better: can) specify
+many parameters, as we have seen. All those parameters are mostly the same for
+all the different checks that support dynamic levels. We can create a helper function
+that makes it easier to declare such parameters in WATO rules.
+
+Checks that support upper *and* lower levels will get two sets of parameters, because
+the logic for upper and lower levels might differ. But the need to share the same
+parameters for the prediction generation (slicing, etc) so that one prediction file
+per check is sufficient.
+
+
+e) PNP Templates
+
+Makeing the actually predicted levels transparent to the user is a crucial point
+in the implementation. An easy way to do this is to simply add the predicted
+levels as additional performance values. The PNP templates of the affected
+checks need to detect the availability of those values on add nice lines to
+the graph.
+
+This is easy - while having one drawback: The user can see predicted levels
+not before they are actually applied. The advantage on the other hand is that
+if parameters for the prediction are changed the graph still correctly shows
+the levels that had been valid at each point of time in the past.
Module: check_mk
Branch: master
Commit: 2d3b19d642167537440b700ecc8c07d31af40489
URL: http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=2d3b19d6421675…
Author: Lars Michelsen <lm(a)mathias-kettner.de>
Date: Mon Dec 3 17:26:23 2012 +0100
FIX: Allowing ":" in application field (e.g. needed for windows logfiles)
---
ChangeLog | 1 +
mkeventd/bin/mkeventd | 5 ++++-
2 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 3e2b0cb..3f2d44b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -43,6 +43,7 @@
Event Console:
* FIX: fix exception in rules that use facility local7
* FIX: fix event icon in case of using TCP access to EC
+ * FIX: Allowing ":" in application field (e.g. needed for windows logfiles)
* Replication slave can now copy rules from master into local configuration
via a new button in WATO.
* Speedup access to event history by earlier filtering and prefiltering with grep
diff --git a/mkeventd/bin/mkeventd b/mkeventd/bin/mkeventd
index 04e3209..b95dc49 100755
--- a/mkeventd/bin/mkeventd
+++ b/mkeventd/bin/mkeventd
@@ -1294,7 +1294,10 @@ class EventServer:
# Variant 1, 2
else:
- tag, message = rest.split(":", 1)
+ # Replaced ":" by ": " here to make tags with ":" possible. This
+ # is needed to process logs generated by windows agent logfiles
+ # like "c://test.log".
+ tag, message = rest.split(": ", 1)
event["text"] = message.strip()
if '[' in tag:
Module: check_mk
Branch: master
Commit: 22bed1d2d004a5093588d5ed88da555583f10976
URL: http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=22bed1d2d004a5…
Author: Mathias Kettner <mk(a)mathias-kettner.de>
Date: Mon Dec 3 15:58:13 2012 +0100
New script for settings/removing downtimes: doc/treasures/downtime
---
ChangeLog | 1 +
doc/treasures/downtime | 192 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 193 insertions(+), 0 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 1bcdf66..3e2b0cb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -34,6 +34,7 @@
* if perfometer now differs between byte and bit output
* FIX: warn / crit levels in if-check when using "bit" as unit
* Use pprint when writing global settings (makes files more readable)
+ * New script for settings/removing downtimes: doc/treasures/downtime
WATO:
* FIX: Fixed generated manual check definitions for checks without items
diff --git a/doc/treasures/downtime b/doc/treasures/downtime
new file mode 100755
index 0000000..1cffd1b
--- /dev/null
+++ b/doc/treasures/downtime
@@ -0,0 +1,192 @@
+#!/usr/bin/python
+# encoding: utf-8
+
+# Sets/Removes downtimes via Check_MK Multisite Webservice
+# Before you can use this script, please read:
+# http://mathias-kettner.de/checkmk_multisite_automation.html
+# And create an automation user - best with the name 'automation'
+# And make sure that this user either has the admin role or is
+# contact for all relevant objects.
+
+# Restrictions / Bugs
+# - When removing host downtimes, always *all* services downtimes
+# are also removed
+# - When removing service downtimes the service names are interpreted
+# as regular expressions, but not when setting
+# -> We need a specialized view for the downtimes. Or even better
+# implement the "Remove all downtimes" button in a normal hosts/
+# services views.
+
+import os, sys, getopt, time, urllib
+
+omd_site = os.getenv("OMD_SITE")
+omd_root = os.getenv("OMD_ROOT")
+
+def bail_out(reason):
+ sys.stderr.write(reason + "\n")
+ sys.exit(1)
+
+def verbose(text):
+ if opt_verbose:
+ sys.stdout.write(text + "\n")
+
+
+def usage():
+ sys.stdout.write("""Usage: downtime [-r] [OPTIONS] HOST [SERVICE1] [SERVICE2...]
+
+This program sets and removes downtimes on hosts and services
+via command line. If you run this script from within an OMD
+site then most options will be guessed automatically. Currently
+the script only supports cookie based login - no HTTP basic
+authentication.
+
+Before you use this script, please read:
+http://mathias-kettner.de/checkmk_multisite_automation.html
+You need to create an automation user - best with the name 'automation'
+- and make sure that this user either has the admin role or is contact
+for all relevant objects.
+
+Options:
+ -v, --verbose Show what's going on
+ -s, --set Set downtime (this is the default and thus optional)
+ -r, --remove Remove all downtimes from that host/service
+ -c, --comment Comment for the downtime (otherwise "Automatic downtime")
+ -d, --duration Duration of the downtime in minutes (default: 120)
+ -h, --help Show this help and exit
+ -u, --user Name of automation user (default: "automation")
+ -S, --secret Automation secret (default: read from user settings)
+ -U, --url Base-URL of Multisite (default: guess local OMD site)
+ -a, --all Include all services when setting/removing host downtime
+""")
+
+
+short_options = 'vhrsc:d:u:S:a'
+long_options = [ "help", "set", "remove", "comment=",
+ "duration=", "user=", "secret=", "all" ]
+
+opt_all = False
+opt_verbose = False
+opt_mode = 'set'
+opt_command = "Automatic downtime"
+opt_user = "automation"
+opt_secret = None
+opt_url = None
+opt_duration = 120
+if omd_site:
+ opt_url = "http://localhost/" + omd_site + "/check_mk/"
+
+try:
+ opts, args = getopt.getopt(sys.argv[1:], short_options, long_options)
+except getopt.GetoptError, err:
+ sys.stderr.write("%s\n\n" % err)
+ usage()
+ sys.exit(1)
+
+for o,a in opts:
+ # Docu modes
+ if o in [ '-h', '--help' ]:
+ usage()
+ sys.exit(0)
+
+ # Modifiers
+ elif o in [ '-v', '--verbose']:
+ opt_verbose = True
+ elif o in [ '-a', '--all']:
+ opt_all = True
+ elif o in [ '-r', '--remove']:
+ opt_mode = 'remove'
+ elif o in [ '-r', '--remove']:
+ opt_mode = 'remove'
+ elif o in [ '-c', '--comment']:
+ opt_comment = a
+ elif o in [ '-d', '--duration']:
+ opt_duration = int(a)
+ elif o in [ '-u', '--user']:
+ opt_user = a
+ elif o in [ '-S', '--secret']:
+ opt_secret = a
+ elif o in [ '-U', '--url']:
+ opt_ = a
+
+if omd_site and not opt_secret:
+ try:
+ opt_secret = file(omd_root + "/var/check_mk/web/" + opt_user
+ + "/automation.secret").read().strip()
+ except Exception, e:
+ bail_out("Cannot read automation secret from user %s: %s" %
+ (opt_user, e))
+
+if not opt_url:
+ bail_out("Please specify the URL to Check_MK Multisite with -U.")
+
+if not opt_url.endswith("/check_mk/"):
+ bail_out("The automation URL must end with /check_mk/")
+
+if not args:
+ bail_out("Please specify the host to set a downtime for.")
+
+arg_host = args[0]
+arg_services = args[1:]
+
+if opt_mode == "set":
+ verbose("Mode: set downtime")
+ verbose("Duration: %ds" % opt_duration)
+else:
+ verbose("Mode: remove downtimes")
+verbose("Host: " + arg_host)
+if arg_services:
+ verbose("Services: " + " ".join(arg_services))
+verbose("Multisite-URL: " + opt_url)
+verbose("User: " + opt_user)
+verbose("Secret: " + opt_secret)
+
+def make_url(base, variables):
+ vartext = "&".join([ "%s=%s" % e for e in variables ])
+ return base + "?" + vartext
+
+variables = [
+ ( "_username", opt_user ),
+ ( "_secret", opt_secret ),
+ ( "_transid", "-1" ),
+ ( "_do_confirm", "yes" ),
+ ( "_do_actions", "yes" ),
+ ( "host", arg_host ),
+]
+
+if opt_mode == 'remove':
+ variables += [
+ ("view_name", "downtimes"),
+ ("_remove_downtimes", "Remove"),
+ ]
+else:
+ variables += [
+ ( "_down_from_now", "yes" ),
+ ( "_down_minutes", opt_duration ),
+ ( "_down_comment", opt_comment ),
+ ]
+ if arg_services:
+ variables.append(("view_name", "service"))
+ else:
+ variables.append(("view_name", "hoststatus"))
+
+def set_downtime(variables, add_vars):
+ url = make_url(opt_url + "view.py", variables + add_vars)
+ verbose("URL: " + url)
+ try:
+ pipe = urllib.urlopen(url)
+ l = len(pipe.readlines())
+ verbose(" --> Got %d lines of response" % l)
+ except Exception, e:
+ bail_out("Cannot call Multisite URL: %s" % e)
+
+
+if arg_services:
+ for service in arg_services:
+ set_downtime(variables, [("service", service + "$")])
+else:
+ set_downtime(variables, [])
+ if opt_all:
+ if opt_mode == 'set':
+ set_downtime(variables, [("view_name", "service")])
+
+
Module: check_mk
Branch: master
Commit: 1dd5a1e70717c4fcd1a67efd3cac1866a2df77ed
URL: http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=1dd5a1e70717c4…
Author: Mathias Kettner <mk(a)mathias-kettner.de>
Date: Fri Nov 30 16:07:09 2012 +0100
FIX: fix event icon in case of using TCP access to EC
---
ChangeLog | 1 +
mkeventd/web/plugins/icons/mkeventd.py | 21 +++++++++++++--------
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 4a46188..1bcdf66 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -41,6 +41,7 @@
Event Console:
* FIX: fix exception in rules that use facility local7
+ * FIX: fix event icon in case of using TCP access to EC
* Replication slave can now copy rules from master into local configuration
via a new button in WATO.
* Speedup access to event history by earlier filtering and prefiltering with grep
diff --git a/mkeventd/web/plugins/icons/mkeventd.py b/mkeventd/web/plugins/icons/mkeventd.py
index fff175f..9487d6b 100644
--- a/mkeventd/web/plugins/icons/mkeventd.py
+++ b/mkeventd/web/plugins/icons/mkeventd.py
@@ -45,18 +45,18 @@ def paint_mkeventd(what, row, tags, custom_vars):
app = None
# Extract parameters from check_command:
- args = command.split('!')[1].split(' ', 1)
+ args = command.split('!')[1].split()
if not args:
return
- if len(args) >= 1:
- # Handle -a and -H options. Sorry for the hack. We currently
- # have no better idea
- if args[0] == '-H':
- args = args[2:] # skip two arguments
- if args[0] == '-a':
- args = args[1:]
+ # Handle -a and -H options. Sorry for the hack. We currently
+ # have no better idea
+ if len(args) >= 2 and args[0] == '-H':
+ args = args[2:] # skip two arguments
+ if len(args) >= 1 and args[0] == '-a':
+ args = args[1:]
+ if len(args) >= 1:
if args[0] == '$HOSTNAME$':
host = row['host_name']
elif args[0] == '$HOSTADDRESS$':
@@ -64,6 +64,11 @@ def paint_mkeventd(what, row, tags, custom_vars):
else:
host = args[0]
+ # If we have no host then the command line from the check_command seems
+ # to be garbled. Better show nothing in this case.
+ if not host:
+ return
+
# It is possible to have a central event console, this is the default case.
# Another possible architecture is to have an event console in each site in
# a distributed environment. For the later case the base url need to be