[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

icinga-migration · 2011-12-01T19:12:21Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2136

Created by mfriedrich on 2011-12-01 19:12:21 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2011-12-02 15:55:45 +00:00)
Target Version: 1.6.1
Last Update: 2011-12-02 15:55:45 +00:00 (in Redmine)

the recent patch #2027 on the freshness checking on startup is heavily affecting the normal scheduled run under weird circumstances.

[1322755814] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.636009
[1322755815] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756114] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635003
[1322756115] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756414] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635272
[1322756415] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756475] Warning: The results of service 'fuckup' on host 'localhost' are stale by 0d 0h 1m 0s (threshold=0d 0h 20m 0s).  I'm forcing an immediate check of the service.
[1322756478] SERVICE ALERT: localhost;fuckup;CRITICAL;HARD;1;CRITICAL: whohooo failure
[1322756478] SERVICE NOTIFICATION: demo;localhost;fuckup;CRITICAL;service-notify-by-email;CRITICAL: whohooo failure

as remarked in #2027 description, it could most likely fail.

a tested reverted fix in OMD 0.52 nightly is working, so this actually needs to be pushed an released into 1.6.1 then.

Attachments

0001-remove-wrong-expiration-calculation-on-freshness-che.patch mfriedrich - 2011-12-02 08:58:09 +00:00

Changesets

2011-12-01 20:02:26 +00:00 by mfriedrich 2208b71

remove wrong expiration calculation on freshness checks #2136

this is a revert, and needs a proper fix, but will do for now.

refs #2136

Relations:

relates #2136

The text was updated successfully, but these errors were encountered:

icinga-migration · 2011-12-01T19:16:32Z

Updated by mfriedrich on 2011-12-01 19:16:32 +00:00

Target Version set to 1.6.1

icinga-migration · 2011-12-01T20:06:04Z

Updated by mfriedrich on 2011-12-01 20:06:04 +00:00

seeking out into the snippet, it might be that the 60 seconds delay on the alert has something to do with the hardcoded 60 seconds in the addition (missing brackets, different and operator?).

-       if (temp_service->check_type == SERVICE_CHECK_PASSIVE) {
-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }
-       }

requires further investigation though. the patch for removal will be in r1.6

icinga-migration · 2011-12-01T22:16:52Z

Updated by mfriedrich on 2011-12-01 22:16:52 +00:00

quickfix/revert here

https://git.icinga.org/?p=icinga-core.git;a=commit;h=2208b714cc3a3e9257d474fd444000c6cc3c82f1

icinga-migration · 2011-12-02T08:58:09Z

Updated by mfriedrich on 2011-12-02 08:58:09 +00:00

File added 0001-remove-wrong-expiration-calculation-on-freshness-che.patch

icinga-migration · 2011-12-02T13:01:59Z

Updated by mfriedrich on 2011-12-02 13:01:59 +00:00

Subject changed from freshness checks are generating stale alerts after 60 seconds, even if result was received to freshness checks are generating stale alerts, even if result was received
Priority changed from High to Urgent

icinga-migration · 2011-12-02T13:44:14Z

Updated by mfriedrich on 2011-12-02 13:44:14 +00:00

Subject changed from freshness checks are generating stale alerts, even if result was received to freshness checks are generating stale alerts, even if result was received in time

further debug logs (cleaned private stuff)

[1322813318.089879] [016.2] [pid=17065] Checking freshness of service 'foo' on host 'bar'...
[1322813318.089883] [016.2] [pid=17065] Freshness thresholds: service=300, use=300
[1322813318.089887] [016.2] [pid=17065] HBC: 1, PS: 1322812951, ES: 1322812958, LC: 1322813213, CT: 1322813318, ET: 1322813258
[1322813318.089966] [016.1] [pid=17065] Check results for service 'foo' on host 'bar' are stale by 0d 0h 1m 0s (threshold=0d 0h 5m 0s).  Forcing an immediate check of the service...
[1322813318.090017] [016.0] [pid=17065] Scheduling a forced, active check of service 'foo' on host 'bar' @ Fri Dec  2 09:08:38 2011
[1322813318.090059] [016.2] [pid=17065] Scheduling new service check event.

has_been_checked: 1
program_start: 1322812951
event_start: 1322812958
last_check: 1322813213
current_time: 1322813318
expiration_time: 1322813258
Freshness thresholds: service=300, use=300

from the new calculation algorithm

-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }

if (1322812958 < 1322812951 + 60 && 1322812958 - x < 300 * 0.618)
expiration_time = 1322812958 + 300

if (expiration_time < current_time)

if (1322813*258* < 1322813*318*)
this is true

analze the above

if (1322812958 < 1322813011 && 1322812958-x < 185,4)
=>
1322812958-x < 185,4
=>
1322812958-185,4 < x
=>
x > 1322812772,6
=>
x > Fri, 02 Dec 2011 07:59:32 +0000

the overall conclusion is that the freshness threshold with the retention.dat creation date calculation between the event_time and a 61,8 percent freshness threshold will cause an error if

event_start is within

program_start+60
last_program_stop happened below 300s*61,8%, so not so long ago

actually the problem is not the last_program_stop, but the condition on the program_start and the 60 seconds. this value is just an assumption and therefore not correct for further usage.

icinga-migration · 2011-12-02T14:14:08Z

Updated by ABauer on 2011-12-02 14:14:08 +00:00

Hi michi,

thanks for figuring this out, in detail!

I ran into the same problem today, after updating to 1.6.

I applied your patch, now everything is fine and running smoothly.

icinga-migration · 2011-12-02T14:49:30Z

Updated by mfriedrich on 2011-12-02 14:49:30 +00:00

thanks for the feedback!

1.6.1 has been put onto sourceforge, and packagers got a seperate mail for that.

icinga-migration · 2011-12-02T15:55:45Z

Updated by mfriedrich on 2011-12-02 15:55:45 +00:00

Status changed from Assigned to Resolved
Done % changed from 0 to 100

icinga-migration closed this as completed Dec 2, 2011

icinga-migration added Urgent bug Passive Checks labels Jan 17, 2017

icinga-migration added this to the 1.6.1 milestone Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

Comments

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 1, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011

icinga-migration commented Dec 2, 2011