Module: check_mk
Branch: master
Commit: a50bbb6f450a8c5f79837be45ff9f99dccb60851
URL:
http://git.mathias-kettner.de/git/?p=check_mk.git;a=commit;h=a50bbb6f450a8c…
Author: Lars Michelsen <lm(a)mathias-kettner.de>
Date: Mon Jul 18 14:50:51 2016 +0200
3672 FIX Agent sections cached by the agent could cause stale services
There are situations, when the agent is caching sections and the check
interval of a check is configured to a special value, that services will
become stale even when the agent could still send cached information.
For example the ntp section which is cached for 30 seconds, wen using a
check interval of 120 seconds, could become stale after 180 seconds when
the update of the agent cache takes more than 60 seconds. The agent would
then throw away the cache which is currently updated and return no ntp
section which results in a stale service then.
The fix is now, that in such a situation the agent is processing the old,
already cached, section and restarting the cache update immediately
(instead of the last run as it was before).
---
.werks/3672 | 22 ++++++++++++++++++++++
ChangeLog | 1 +
agents/check_mk_agent.linux | 7 ++++---
3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/.werks/3672 b/.werks/3672
new file mode 100644
index 0000000..4fee24f
--- /dev/null
+++ b/.werks/3672
@@ -0,0 +1,22 @@
+Title: Agent sections cached by the agent could cause stale services
+Level: 1
+Component: checks
+Class: fix
+Compatible: compat
+State: unknown
+Version: 1.4.0i1
+Date: 1468845997
+
+There are situations, when the agent is caching sections and the check
+interval of a check is configured to a special value, that services will
+become stale even when the agent could still send cached information.
+
+For example the ntp section which is cached for 30 seconds, wen using a
+check interval of 120 seconds, could become stale after 180 seconds when
+the update of the agent cache takes more than 60 seconds. The agent would
+then throw away the cache which is currently updated and return no ntp
+section which results in a stale service then.
+
+The fix is now, that in such a situation the agent is processing the old,
+already cached, section and restarting the cache update immediately
+(instead of the last run as it was before).
diff --git a/ChangeLog b/ChangeLog
index 219c4a7..cea99fd 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -313,6 +313,7 @@
* 3708 FIX: cisco_vpn_tunnel: fixed missing phase 2 data
* 3556 FIX: agent_vsphere.pysphere: The ESX 4.1 compatible agent version no longer
validates the ssl certificate
* 3709 FIX: cisco_wlc, cisco_wlc_clients: fixed scan function and incomplete listing
of interfaces
+ * 3672 FIX: Agent sections cached by the agent could cause stale services...
Multisite:
* 3187 notification view: new filter for log command via regex
diff --git a/agents/check_mk_agent.linux b/agents/check_mk_agent.linux
index 30afede..fa4e8db 100755
--- a/agents/check_mk_agent.linux
+++ b/agents/check_mk_agent.linux
@@ -145,8 +145,10 @@ function run_cached () {
if [ ! -d $MK_VARDIR/cache ]; then mkdir -p $MK_VARDIR/cache ; fi
CACHEFILE="$MK_VARDIR/cache/$NAME.cache"
- # Check if the creation of the cache takes suspiciously long and return
- # nothing if the age (access time) of $CACHEFILE.new is twice the MAXAGE
+ # Check if the creation of the cache takes suspiciously long and kill the
+ # process if the age (access time) of $CACHEFILE.new is twice the MAXAGE.
+ # Output the evantually already cached section anyways and start the cache
+ # update again.
if [ -e "$CACHEFILE.new" ] ; then
local CF_ATIME=$(stat -c %X "$CACHEFILE.new")
if [ $((NOW - CF_ATIME)) -ge $((MAXAGE * 2)) ] ; then
@@ -154,7 +156,6 @@ function run_cached () {
# it is still running. This avoids overlapping processes!
fuser -k -9 "$CACHEFILE.new" >/dev/null 2>&1
rm -f "$CACHEFILE.new"
- return
fi
fi