Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

Closed
icinga-migration opened this issue Dec 1, 2011 · 9 comments

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2136

Created by mfriedrich on 2011-12-01 19:12:21 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2011-12-02 15:55:45 +00:00)
Target Version: 1.6.1
Last Update: 2011-12-02 15:55:45 +00:00 (in Redmine)


the recent patch #2027 on the freshness checking on startup is heavily affecting the normal scheduled run under weird circumstances.

[1322755814] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.636009
[1322755815] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756114] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635003
[1322756115] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756414] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635272
[1322756415] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756475] Warning: The results of service 'fuckup' on host 'localhost' are stale by 0d 0h 1m 0s (threshold=0d 0h 20m 0s).  I'm forcing an immediate check of the service.
[1322756478] SERVICE ALERT: localhost;fuckup;CRITICAL;HARD;1;CRITICAL: whohooo failure
[1322756478] SERVICE NOTIFICATION: demo;localhost;fuckup;CRITICAL;service-notify-by-email;CRITICAL: whohooo failure

as remarked in #2027 description, it could most likely fail.

a tested reverted fix in OMD 0.52 nightly is working, so this actually needs to be pushed an released into 1.6.1 then.

Attachments

Changesets

2011-12-01 20:02:26 +00:00 by mfriedrich 2208b71

remove wrong expiration calculation on freshness checks #2136

this is a revert, and needs a proper fix, but will do for now.

refs #2136

Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-01 19:16:32 +00:00

  • Target Version set to 1.6.1

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-01 20:06:04 +00:00

seeking out into the snippet, it might be that the 60 seconds delay on the alert has something to do with the hardcoded 60 seconds in the addition (missing brackets, different and operator?).

-       if (temp_service->check_type == SERVICE_CHECK_PASSIVE) {
-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }
-       }

requires further investigation though. the patch for removal will be in r1.6

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-01 22:16:52 +00:00

quickfix/revert here

https://git.icinga.org/?p=icinga-core.git;a=commit;h=2208b714cc3a3e9257d474fd444000c6cc3c82f1

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-02 08:58:09 +00:00

  • File added 0001-remove-wrong-expiration-calculation-on-freshness-che.patch

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-02 13:01:59 +00:00

  • Subject changed from freshness checks are generating stale alerts after 60 seconds, even if result was received to freshness checks are generating stale alerts, even if result was received
  • Priority changed from High to Urgent

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-02 13:44:14 +00:00

  • Subject changed from freshness checks are generating stale alerts, even if result was received to freshness checks are generating stale alerts, even if result was received in time

further debug logs (cleaned private stuff)

[1322813318.089879] [016.2] [pid=17065] Checking freshness of service 'foo' on host 'bar'...
[1322813318.089883] [016.2] [pid=17065] Freshness thresholds: service=300, use=300
[1322813318.089887] [016.2] [pid=17065] HBC: 1, PS: 1322812951, ES: 1322812958, LC: 1322813213, CT: 1322813318, ET: 1322813258
[1322813318.089966] [016.1] [pid=17065] Check results for service 'foo' on host 'bar' are stale by 0d 0h 1m 0s (threshold=0d 0h 5m 0s).  Forcing an immediate check of the service...
[1322813318.090017] [016.0] [pid=17065] Scheduling a forced, active check of service 'foo' on host 'bar' @ Fri Dec  2 09:08:38 2011
[1322813318.090059] [016.2] [pid=17065] Scheduling new service check event.

has_been_checked: 1
program_start: 1322812951
event_start: 1322812958
last_check: 1322813213
current_time: 1322813318
expiration_time: 1322813258
Freshness thresholds: service=300, use=300

from the new calculation algorithm

-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }

if (1322812958 < 1322812951 + 60 && 1322812958 - x < 300 * 0.618)
expiration_time = 1322812958 + 300

if (expiration_time < current_time)

if (1322813*258* < 1322813*318*)
this is true

analze the above

if (1322812958 < 1322813011 && 1322812958-x < 185,4)
=>
1322812958-x < 185,4
=>
1322812958-185,4 < x
=>
x > 1322812772,6
=>
x > Fri, 02 Dec 2011 07:59:32 +0000

the overall conclusion is that the freshness threshold with the retention.dat creation date calculation between the event_time and a 61,8 percent freshness threshold will cause an error if

event_start is within

  • program_start+60
  • last_program_stop happened below 300s*61,8%, so not so long ago

actually the problem is not the last_program_stop, but the condition on the program_start and the 60 seconds. this value is just an assumption and therefore not correct for further usage.

@icinga-migration
Copy link
Author

Updated by ABauer on 2011-12-02 14:14:08 +00:00

Hi michi,

thanks for figuring this out, in detail!

I ran into the same problem today, after updating to 1.6.

I applied your patch, now everything is fine and running smoothly.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-02 14:49:30 +00:00

thanks for the feedback!

1.6.1 has been put onto sourceforge, and packagers got a seperate mail for that.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2011-12-02 15:55:45 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant