Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #1978] read last_program_stop from retention.dat and use that for freshness calculations on startup instead of event_time #755

Closed
icinga-migration opened this issue Oct 6, 2011 · 4 comments

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/1978

Created by mfriedrich on 2011-10-06 11:23:50 +00:00

Assignee: mfriedrich
Status: Closed (closed on 2012-08-22 16:31:06 +00:00)
Target Version: (none)
Last Update: 2012-08-22 16:31:06 +00:00 (in Redmine)


this is a pretty epic idea, because long shutdown icinga cores will have the problem that the freshness checks on startup is being dependent on the expiration time.

is_service_result_fresh 

if(temp_service->has_been_checked == FALSE)
      expiration_time = (time_t)(event_start + freshness_threshold);

which then results in

/* the results for the last check of this service are stale */
if(expiration_time < current_time) {

the main problem with this attempt - if there is no retention.dat this logic would fail then being changed in this way. a not accurate solution would be to always write retention.dat - as we need that currently. or introduce a token to indicate the program stop either way. but it should be added to the docs that retained state information now also contains the indication for the last program stop and will be therefore marked mandatory for freshness checks on passive checks (i.e. on passive slaves in distributed setups).

https://github.com/dnsmichi/nagios-fixed/commit/472d92ac81218f85c81571e31963545ebec7a988
https://github.com/dnsmichi/nagios-fixed/commit/8a8238f37a46f2ca73bebcf728a610385d49acd4


Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-11-11 15:10:51 +00:00

  • Category set to Passive Checks
  • Status changed from New to Resolved
  • Assigned to set to mfriedrich
  • Target Version set to 1.6
  • Done % changed from 0 to 100

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-02 15:47:02 +00:00

  • Status changed from Resolved to Feedback
  • Target Version deleted 1.6
  • Done % changed from 100 to 0

this is the cause for #2136, needs a proper rework and tested version.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-08 12:26:15 +00:00

as analyzed at first glance in #2136, the program_stop+60 remains the wrong assumption in that case.

possible fix below, needs deeper testing.

Revision: 1848
          http://nagios.svn.sourceforge.net/nagios/?rev=1848&view=rev
Author:   ageric
Date:     2011-12-08 12:12:02 +0000 (Thu, 08 Dec 2011)
Log Message:
-----------
core: Fix passive check result freshness test after restart

The last version of the code to avoid sending notifications about stale
checks on start confused event_start and last_check - it would trigger
whenever nagios took less than 60 seconds to start, and it had been
turned off for some time before, and would override the last check
timestamp with the nagios start time.

Signed-off-by: Robin Sonefors 

Modified Paths:
--------------
    nagioscore/trunk/base/checks.c

Modified: nagioscore/trunk/base/checks.c
===================================================================
--- nagioscore/trunk/base/checks.c  2011-12-08 11:39:34 UTC (rev 1847)
+++ nagioscore/trunk/base/checks.c  2011-12-08 12:12:02 UTC (rev 1848)
@@ -2093,15 +2093,15 @@
     * If the check was last done passively, we assume it's going
     * to continue that way and we need to handle the fact that
     * Nagios might have been shut off for quite a long time. If so,
-    * we mustn't spam freshness notifications but use program_start_time
+    * we mustn't spam freshness notifications but use event_start
     * instead of last_check to determine freshness expiration time.
     * The threshold for "long time" is determined as 61.8% of the normal
     * freshness threshold based on vast heuristical research (ie, "some
     * guy once told me the golden ratio is good for loads of stuff").
     */
    if (temp_service->check_type == SERVICE_CHECK_PASSIVE) {
-       if (event_start < program_start + 60 &&
-           event_start - last_program_stop < (freshness_threshold * 0.618))
+       if (temp_service->last_check < event_start &&
+           event_start - last_program_stop < freshness_threshold * 0.618)
        {
            expiration_time = event_start + freshness_threshold;
        }
@@ -2521,15 +2521,15 @@
     * If the check was last done passively, we assume it's going
     * to continue that way and we need to handle the fact that
     * Nagios might have been shut off for quite a long time. If so,
-    * we mustn't spam freshness notifications but use program_start_time
+    * we mustn't spam freshness notifications but use event_start
     * instead of last_check to determine freshness expiration time.
     * The threshold for "long time" is determined as 61.8% of the normal
     * freshness threshold based on vast heuristical research (ie, "some
     * guy once told me the golden ratio is good for loads of stuff").
     */
    if (temp_host->check_type == HOST_CHECK_PASSIVE) {
-       if (event_start < program_start + 60 &&
-           event_start - last_program_stop < (freshness_threshold * 0.618))
+       if (temp_host->last_check < event_start &&
+           event_start - last_program_stop > freshness_threshold * 0.618)
        {
            expiration_time = event_start + freshness_threshold;
        }

This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2012-08-22 16:31:06 +00:00

  • Status changed from Feedback to Closed

i don't see the need for that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant