Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #7287] Re-checks scheduling w/ retry_interval #2009

Closed
icinga-migration opened this issue Sep 20, 2014 · 31 comments
Closed

[dev.icinga.com #7287] Re-checks scheduling w/ retry_interval #2009

icinga-migration opened this issue Sep 20, 2014 · 31 comments
Labels
blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/7287

Created by wimmer on 2014-09-20 15:16:39 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-01-22 18:12:39 +00:00)
Target Version: 2.4.2
Last Update: 2016-03-09 10:36:02 +00:00 (in Redmine)

Icinga Version: 2.1.1

I found strange behavior of re-checks scheduling.
I use Icinga1 and Icinga2 Debian packages and I use the same attributes for checks:

check_interval = 5m
retry_interval = 1m
max_check_attempts = 3

When some service changes its status, Icinga1 works as expected:

Current Status  Current Attempt         Last Check Time             Next Scheduled Active Check
OK                     1/3  (HARD state)       20-09-2014 15:00:00     20-09-2014 15:05:00
CRITICAL        1/3  (SOFT state)        20-09-2014 15:05:00     20-09-2014 15:06:00
CRITICAL        2/3  (SOFT state)        20-09-2014 15:06:00     20-09-2014 15:07:00
CRITICAL        3/3  (HARD state)       20-09-2014 15:07:00     20-09-2014 15:12:00
CRITICAL        3/3  (HARD state)       20-09-2014 15:12:00     20-09-2014 15:17:00

But Icinga2 uses strange Attempt numbering [there is no 1/3 (SOFT state)]
and bad time for scheduling of first re-check (15:10 instead of 15:06).
It looks like Icinga2 misses "1/3 SOFT state" phase.
There is wrong scheduling in CRITICAL/HARD states too (compare with Icinga1 times):

Current Status  Current Attempt          Last Check Time            Next Scheduled Active Check
OK                     1/3  (HARD state)       09-20-2014 15:00:10     09-20-2014 15:05:10
CRITICAL         2/3  (SOFT state)       09-20-2014 15:05:10     09-20-2014 15:10:10
CRITICAL         3/3  (SOFT state)       09-20-2014 15:10:10     09-20-2014 15:11:10
CRITICAL         1/3  (HARD state)       09-20-2014 15:11:10     09-20-2014 15:12:10
CRITICAL         1/3  (HARD state)       09-20-2014 15:12:10     09-20-2014 15:15:10

Attachments

Changesets

2015-03-11 15:33:36 +00:00 by mfriedrich 05c237c

Don't increment check attempt counter on OK->NOT-OK transition

refs #7287

Signed-off-by: Michael Friedrich <michael.friedrich@netways.de>

2015-08-21 08:24:49 +00:00 by mfriedrich 6f252bb

Don't increment check attempt counter on OK->NOT-OK transition

This fixes the problem that the first SOFT state is actually considered
the second state.

refs #7287
fixes #9897

Signed-off-by: Michael Friedrich <michael.friedrich@netways.de>

2016-01-20 15:29:01 +00:00 by mfriedrich a51e647

Fix check scheduling w/ retry_interval

fixes #7287

2016-01-22 17:40:14 +00:00 by mfriedrich 2a11b27

Properly set the next check time for active and passive checks

fixes #7287
refs #11019

2016-02-23 08:24:57 +00:00 by mfriedrich 9ca7245

Properly set the next check time for active and passive checks

fixes #7287
refs #11019

2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-05 17:16:49 +00:00 by mfriedrich ef532f2

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:03 +00:00 by mfriedrich 8344f74

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:14 +00:00 by mfriedrich f99feab

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-11 17:18:26 +00:00

  • Category set to libicinga

@icinga-migration
Copy link
Author

Updated by wimmer on 2014-11-23 13:33:59 +00:00

After 2 months I tried scheduling again. I use Debian packages version 2.2.
Scheduler is still confused.
Here is my finding:

Current Status  Current Attempt            Last Check Time            Next Scheduled Active Check
OK                      1/3  (HARD state)        11-23-2014 10:00:40     11-23-2014 10:05:40
CRITICAL          2/3  (SOFT state)        11-23-2014 10:05:40      11-23-2014 10:10:40
CRITICAL          3/3  (SOFT state)        11-23-2014 10:10:40      11-23-2014 10:11:40
CRITICAL          1/3  (HARD state)       11-23-2014 10:11:40      11-23-2014 10:12:40
CRITICAL          1/3  (HARD state)       11-23-2014 10:12:40      11-23-2014 10:15:40

Please can somebody confirm this bug?

I would like to use Icinga2 on my production systems, but is impossible with this strange behavior.
Thanks for your work.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-11-24 13:38:29 +00:00

  • Target Version set to 2.3.0

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-01-26 15:05:56 +00:00

  • Target Version changed from 2.3.0 to 2.4.0

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-02-20 20:32:16 +00:00

  • File added 0001-Don-t-increment-check-attempt-counter-on-OK-NOT-OK-t.patch

Please test the attached patch, untested.

@icinga-migration
Copy link
Author

Updated by schirrmie on 2015-03-05 07:25:22 +00:00

Hi,
here are two issues. The first one is that the check attempt is wrong counted if state changed from soft to hard.
Your attached patch worked fine for us and this issue is solved -> thank You!
The second issue is that the retry_interval is not handled correctly, we have the same bug here with icinga 2.2.4.
Here is an example from us:

max_check_attempts = 3
check_interval = 2m
retry_interval = 1m


state       last        next        attempt
ok          07:49       07:51       1/3 H
critical    07:51       07:53       1/3 S   # wrong
critical    07:53       07:54       2/3 S
critical    07:54       07:55       3/3 S
critical    07:55       07:56       1/3 H   # wrong
critical    07:56       07:57       1/3 H   # wrong
critical    07:57       07:59       1/3 H
ok          07:59       08:01       1/3 H

Your patch does not change the behaviour of the retry_interval.
I hope this helps.

Regards
schirrmie

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-07 15:35:40 +00:00

Ok, thanks for testing. Then I'll leave this issue open for 2.4 and won't apply the patch for 2.3 for now. I don't want to break the release.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-03-11 15:34:03 +00:00

  • File deleted 0001-Don-t-increment-check-attempt-counter-on-OK-NOT-OK-t.patch

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-04-08 14:28:41 +00:00

  • Status changed from New to Assigned
  • Assigned to set to mfriedrich

Ok, found the issue we've been discussing yesterday evening before the SIG NOC meeting :)

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-05-18 13:11:00 +00:00

  • Subject changed from Re-checks scheduling to Re-checks scheduling w/ retry_interval
  • Target Version deleted 2.4.0

I'll need to dig into this a little deeper. I'm not sure if this might happen for 2.4.0 - feel free to submit any patches helping resolve the issue.

@icinga-migration
Copy link
Author

Updated by zearan on 2015-05-26 12:03:02 +00:00

  • File added fix_scheduling.patch

The problem seems to lie in file libs/icinga/checkable-check.cpp. The function Checkable::ExecuteCheck() (line 521) executes the calculation of the next check before running the check itself:

void Checkable::ExecuteCheck()
{
[...]
UpdateNextCheck();
[...]
GetCheckCommand()->Execute(this, cr, NULL, false);
[...]
}

The time of the next check does not get re-evaluated by Checkable::ProcessCheckResult() (line 237) when the result is available. I have tried to fix it by a one-liner but I cannot say if this causes any unwanted side effects. At least it did not crash anything when I tried it.

@icinga-migration
Copy link
Author

Updated by wimmer on 2015-06-19 17:34:18 +00:00

I applied patch to 2.3.5 and it works great.
Please insert it into master branch.
Thank you very much for your work.

Here is my setup:

object Host "hhh" {
  import "generic-host"
    check_interval = 5m
    retry_interval = 1m
    max_check_attempts = 3
  address = "hhh"
}

object Service "SMTP" {
  import "generic-service"
    check_interval = 5m
    retry_interval = 1m
    max_check_attempts = 3
  host_name = "hhh"
  check_command = "smtp"
}

and results:

Service SMTP:
==============
Current Status: Current Attempt:        Last Check Time:        Next Scheduled Active Check:
OK                     1/3  (HARD state)       06-19-2015 18:27:43     06-19-2015 18:32:43
CRITICAL        2/3  (SOFT state)       06-19-2015 18:32:43     06-19-2015 18:33:43
CRITICAL        3/3  (SOFT state)       06-19-2015 18:33:43     06-19-2015 18:34:43
CRITICAL        1/3  (HARD state)       06-19-2015 18:34:43     06-19-2015 18:37:43
CRITICAL        1/3  (HARD state)       06-19-2015 18:37:43     06-19-2015 18:42:43
OK                     1/3  (HARD state)       06-19-2015 18:42:43     06-19-2015 18:47:43



Host:
==========
Current Status: Current Attempt:        Last Check Time:        Next Scheduled Active Check:
UP                     1/3  (HARD state)       06-19-2015 18:54:35     06-19-2015 18:59:31     
DOWN              2/3  (SOFT state)       06-19-2015 18:59:34     06-19-2015 19:00:31
DOWN              3/3  (SOFT state)       06-19-2015 19:00:34     06-19-2015 19:01:31
DOWN              1/3  (HARD state)       06-19-2015 19:01:34     06-19-2015 19:04:31
DOWN              1/3  (HARD state)       06-19-2015 19:04:34     06-19-2015 19:09:31
UP                     1/3  (HARD state)       06-19-2015 19:09:35     06-19-2015 19:14:31

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-06-23 13:26:15 +00:00

  • Target Version set to Backlog

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-17 18:47:40 +00:00

  • Relates set to 9897

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-18 15:01:14 +00:00

Generally speaking I'd like to split this issue, as it consists of two issues:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-18 16:14:00 +00:00

  • Target Version changed from Backlog to 2.4.0
  • Estimated Hours set to 2

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-18 16:48:33 +00:00

object Host NodeName + "3" {
  import NodeName
  check_command = "dummy"
  vars.dummy_text = "Last check: $host.last_check$ Last state change: $host.last_state_change$"
  vars.dummy_state = 0

  check_interval = 20m
  retry_interval = 5s
}

Apparently the fix does not work this way, host.next_check still stays at the old scheduled timestamp.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-28 13:09:10 +00:00

  • Target Version changed from 2.4.0 to Backlog

@icinga-migration
Copy link
Author

Updated by akrus on 2015-12-22 07:21:51 +00:00

I'm also experiencing the same issue here.

apply Service "WHOIS" {
  import "generic-service"

  zone = "master"

  check_command = "check_whois"

  groups += [ "whois" ]
  check_interval = 24h
  retry_interval = 1m

  vars.notification.mail.groups = [ "admins" ]
  vars.notification.line.groups = [ "line" ]

  assign where host.vars.domain
}

This results in checks being executes every 24 hours (retry interval doesn't work, while Icinga Web 2 shows it correctly).

@icinga-migration
Copy link
Author

Updated by Peter_Story on 2016-01-14 16:02:47 +00:00

I'm also encountering this bug. I have a service with:

check_interval = 10m
retry_interval = 30s

After the first soft failure, the subsequent check is scheduled to run after 10 minutes, instead of 30 seconds.

@icinga-migration
Copy link
Author

Updated by arothauf on 2016-01-20 09:34:21 +00:00

Can confirm this still happening.
Should this bug not have a higher priority because of Notifications depending on it turning to hard state?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-20 15:22:12 +00:00

  • Target Version changed from Backlog to 2.4.2

It fairly should but we're all normal people, not machines.

I found another issue while testing the original patch which took me quite a while to figure out why it was happening.

Tests using the external command pipe

object Host "7287-host" {

  address = "1.2.3.4"
  check_command = "ping4"
  check_interval = 30m
  retry_interval = 10s
}

Now process check results using Icinga Web 2, from DOWN to UP and vice versa.

Calling UpdateNextCheck() works as it should, it contains the correct next GetRetryInterval() values and also sends that to the database.

There's one subsequent SetNextCheck() call afterwards which then overrides the value inside the core (and so the API and external interfaces).

[2016-01-20 15:43:42 +0100] warning/Checkable: Using next_check '1453301030.300000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453301030) WHERE host_object_id = 132255
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'ping4',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '1',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 1,  event_handler = '',  event_handler_enabled = '1',  execution_time = '0',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 132255,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1453301022),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1453298041),  last_state_change = FROM_UNIXTIME(1453301022),  last_time_down = FROM_UNIXTIME(1453301022),  last_time_up = FROM_UNIXTIME(1453300989),  latency = '0',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1453301030),  normal_check_interval = '30',  notifications_enabled = '1',  original_attributes = 'null',  output = 'sfsdfsfs',  passive_checks_enabled = '1',  percent_state_change = '0',  perfdata = '',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.166667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1453301022) WHERE host_object_id = 132255
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453302822) WHERE host_object_id = 132255

We do ensure to reschedule a check when processing a check result from the external command pipe or api actions. This is due to the reason we require freshness checks using the check_interval (see #7071 for better proposals) for incoming passive check results.

So not a good method for tests anyways. Temporarily removing the code blocks from externalcommandlistener.cpp proves the patch working.

API tests

[2016-01-20 16:18:46 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'ping4',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '1',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 1,  event_handler = '',  event_handler_enabled = '1',  execution_time = '10.018226861953735',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 132255,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1453303125),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1453303093),  last_state_change = FROM_UNIXTIME(1453303125),  last_time_down = FROM_UNIXTIME(1453303125),  last_time_up = FROM_UNIXTIME(1453303093),  latency = '0',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1453303132),  normal_check_interval = '30',  notifications_enabled = '1',  original_attributes = '{\"address\":\"8.8.8.8\"}',  output = 'CRITICAL - Plugin timed out after 10 seconds',  passive_checks_enabled = '1',  percent_state_change = '8',  perfdata = '',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.166667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1453303125) WHERE host_object_id = 132255
[2016-01-20 16:18:52 +0100] warning/Checkable: Using next_check '1453303142.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:18:53 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303142) WHERE host_object_id = 132255
[2016-01-20 16:19:02 +0100] warning/Checkable: Using next_check '1453303152.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:19:02 +0100] warning/Checkable: Using next_check '1453303152.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303152) WHERE host_object_id = 132255
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303152) WHERE host_object_id = 132255
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'ping4',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '2',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 1,  event_handler = '',  event_handler_enabled = '1',  execution_time = '10.010746002197266',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 132255,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1453303142),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1453303093),  last_state_change = FROM_UNIXTIME(1453303125),  last_time_down = FROM_UNIXTIME(1453303142),  last_time_up = FROM_UNIXTIME(1453303093),  latency = '0',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1453303152),  normal_check_interval = '30',  notifications_enabled = '1',  original_attributes = '{\"address\":\"8.8.8.8\"}',  output = 'CRITICAL - Plugin timed out after 10 seconds',  passive_checks_enabled = '1',  percent_state_change = '8',  perfdata = '',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.166667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1453303142) WHERE host_object_id = 132255

$ curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/hosts/7287-host' -H "accept: application/json" -X POST -d '{ "attrs" : { "address": "1.2.3.4" }}'

Manual recheck will cause ProcessCheckResult() to immediately update the next check using the retry interval.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-20 15:30:03 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset a51e647.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 16:12:08 +00:00

  • Blocks set to 11019

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 16:31:21 +00:00

  • Relates set to 10963

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 17:37:02 +00:00

  • Status changed from Resolved to Assigned
  • Priority changed from Normal to High

There's still one issue with passive check results - calling UpdateNextCheck() will first set a probably smaller next check time, and after invoking ProcessCheckResult() a separate call to SetNextCheck() is required to ensure the check_interval is used for passively received checks and their freshness.

This merely influences all checks where we do not care about them whether being active or passive.

It is also worse for active checks - the next check time is updated to check_interval if sent in as passive check result via API / UI.

Moving that task inside ProcessCheckResult() ensures that subsequent calls to SetNextCheck() happen with the correct check interval for passive check results. The active checks are then rescheduled based on their current state, especially if changed from sending a check result enforcing a state change, e.g. HARD OK, or SOFT NOT-OK with retry_interval being required.

Furthermore updating the next check time before sending a DB IDO update for OnNewCheckResult will also help reduce duplicated next_check queries (see linked #11019).

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-01-22 18:12:39 +00:00

  • Status changed from Assigned to Resolved

Applied in changeset 2a11b27.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-18 10:24:53 +00:00

  • Duplicated set to 11174

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-02-25 13:59:06 +00:00

  • Relates set to 11229

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-09 10:34:46 +00:00

  • Relates set to 11336

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-09 10:36:02 +00:00

The patch was broken in many ways, causing problems with actively scheduled checks, passive checks, etc. The original fix has been reverted, and a proper fix is discussed in #11336.

@icinga-migration icinga-migration added blocker Blocks a release or needs immediate attention bug Something isn't working libicinga labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.2 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant