ID: 8350
Title: Real-time checks: Introducing checking in one second resolution
Component: The Check_MK Micro Core
Level: 2
Class: New Feature
Version: 1.2.7i4
This release introduces a new feature named Real-time checks. With the new
Real-time checks it is possible to monitor specific things a lot shorter
intervals than the normal interval of 60 seconds.
This feature has mainly been developed to get detailed graphs for some values
which change often like for example the memory usage or CPU utilization. But
not only the performance data is updated in this interval. The complete service
state is updated which may also result in faster notifications.
The Real-time checks are working like this: The core is listening on the network
for incoming Real-time check results which are basically UDP packets sent by
the agents in an interval of one second. This needds to be enabled using the
configuration option <i>Global Settings > Enable handling of Real-Time
Checks</i>.
You need to configure the UDP port to listen on (6559 by default) and the secret
which is used to decrypt the Real-time checks. This secret needs to be equal for
the Check_MK server and all agents which are sending Real-time check results.
The agents need to be configured to send Real-time check results. This can currently
be done for the Linux and Windows Agents. On linux you need to create a file
/etc/check_mk/real_time_checks.cfg with the following contents:
F+:/etc/check_mk/real_time_checks.cfg
RTC_TIMEOUT=90
RTC_PORT=6559
RTC_SECRET='hallo123'
RTC_SECTIONS=""
RTC_SECTIONS+="mem "
RTC_SECTIONS+="cpu "
F-:
It is a good idea to reduce permissions on this file because it contains the
real-time check secret which is shared between the Check_MK agent and the server
to encrypt the transfered data. For example <tt>chmod 640
/etc/check_mk/real_time_checks.cfg</tt>
is a good idea.
On windows you need to add the following to the <tt>[global]</tt> section
of your check_mk.ini and restart the Check_MK service:
F+:check_mk.ini
realtime_port = 6559
realtime_sections = mem winperf
realtime_timeout = 90
passphrase = hallo123
F-:
The agent is working as usual, waiting for connections from the Check_MK server.
Once a Check_MK server is contacting the agent, the agent is responding with
it's regular response. Now, when Real-time checks are enabled, the agent is
sending one UDP packet for each enabled section per second to the host address
which had queried the Check_MK agent, which is normally the Check_MK servers
address.
The data which can be processed as Real-time check is limited, so we limit the
sections which can be send as Real-time checks. Currently you can enable only
the <tt>mem</tt> and <tt>cpu</tt> sections on linux and
<tt>mem</tt> and
<tt>winperf</tt> on windows systems. This might be extended in the future.
To get detailed graphs, you now need to configure your RRD databases to be able
to store these detailed information. This can be done via the ruleset
<i>Host & Service Parameters > Monitoring Configuration > Configuration of
RRD databases of services</i>.
You need to create a new rule and first need to ensure that you only apply the
rule to checks which get real-time check information as the RRDs of these
services need more disk space. So you should only select the CPU/Memory services
of hosts which are sending Real-time check results.
Then you need to configure this rule to have a 1 second precision for a duration
of your choice.
Just one example configuration for having:
a) 1 second resolution for 4 hours
b) 1 minute resolution for 2 days
c) 5 minute resolution for 10 days
d) 30 minute resolution for 90 days
e) 6 hour resolution for 4 years
You need to configure these numbers:
C:+
Step (precision): 1 sec.
RRA configuration:
50.0%, 1, 14400
50.0%, 60, 2880
50.0%, 300, 2880
50.0%, 1800, 4320
50.0%, 21600, 5840
C:-
After you configured this, you need to run <tt>cmk --convert-rrds -v</tt> to
convert
the existing RRDs.
After the conversion has finished and processing of the Real-time checks works
correctly, you should see the service state, output and graphs e.g. of the "CPU
utilisation"
service updating in an interval of one second.