New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #7287] Re-checks scheduling w/ retry_interval #2009
Comments
Updated by mfriedrich on 2014-10-11 17:18:26 +00:00
|
Updated by wimmer on 2014-11-23 13:33:59 +00:00 After 2 months I tried scheduling again. I use Debian packages version 2.2.
Please can somebody confirm this bug? I would like to use Icinga2 on my production systems, but is impossible with this strange behavior. |
Updated by mfriedrich on 2014-11-24 13:38:29 +00:00
|
Updated by mfriedrich on 2015-01-26 15:05:56 +00:00
|
Updated by mfriedrich on 2015-02-20 20:32:16 +00:00
Please test the attached patch, untested. |
Updated by schirrmie on 2015-03-05 07:25:22 +00:00 Hi,
Your patch does not change the behaviour of the retry_interval. Regards |
Updated by mfriedrich on 2015-03-07 15:35:40 +00:00 Ok, thanks for testing. Then I'll leave this issue open for 2.4 and won't apply the patch for 2.3 for now. I don't want to break the release. |
Updated by mfriedrich on 2015-03-11 15:34:03 +00:00
|
Updated by mfriedrich on 2015-04-08 14:28:41 +00:00
Ok, found the issue we've been discussing yesterday evening before the SIG NOC meeting :) |
Updated by mfriedrich on 2015-05-18 13:11:00 +00:00
I'll need to dig into this a little deeper. I'm not sure if this might happen for 2.4.0 - feel free to submit any patches helping resolve the issue. |
Updated by zearan on 2015-05-26 12:03:02 +00:00
The problem seems to lie in file libs/icinga/checkable-check.cpp. The function Checkable::ExecuteCheck() (line 521) executes the calculation of the next check before running the check itself:
The time of the next check does not get re-evaluated by Checkable::ProcessCheckResult() (line 237) when the result is available. I have tried to fix it by a one-liner but I cannot say if this causes any unwanted side effects. At least it did not crash anything when I tried it. |
Updated by wimmer on 2015-06-19 17:34:18 +00:00 I applied patch to 2.3.5 and it works great. Here is my setup:
and results:
|
Updated by mfriedrich on 2015-06-23 13:26:15 +00:00
|
Updated by mfriedrich on 2015-08-17 18:47:40 +00:00
|
Updated by mfriedrich on 2015-08-18 15:01:14 +00:00 Generally speaking I'd like to split this issue, as it consists of two issues:
|
Updated by mfriedrich on 2015-08-18 16:14:00 +00:00
|
Updated by mfriedrich on 2015-08-18 16:48:33 +00:00
Apparently the fix does not work this way, host.next_check still stays at the old scheduled timestamp. |
Updated by mfriedrich on 2015-09-28 13:09:10 +00:00
|
Updated by akrus on 2015-12-22 07:21:51 +00:00 I'm also experiencing the same issue here.
This results in checks being executes every 24 hours (retry interval doesn't work, while Icinga Web 2 shows it correctly). |
Updated by Peter_Story on 2016-01-14 16:02:47 +00:00 I'm also encountering this bug. I have a service with:
After the first soft failure, the subsequent check is scheduled to run after 10 minutes, instead of 30 seconds. |
Updated by arothauf on 2016-01-20 09:34:21 +00:00 Can confirm this still happening. |
Updated by mfriedrich on 2016-01-20 15:22:12 +00:00
It fairly should but we're all normal people, not machines. I found another issue while testing the original patch which took me quite a while to figure out why it was happening. Tests using the external command pipe
Now process check results using Icinga Web 2, from DOWN to UP and vice versa. Calling UpdateNextCheck() works as it should, it contains the correct next GetRetryInterval() values and also sends that to the database. There's one subsequent SetNextCheck() call afterwards which then overrides the value inside the core (and so the API and external interfaces).
We do ensure to reschedule a check when processing a check result from the external command pipe or api actions. This is due to the reason we require freshness checks using the check_interval (see #7071 for better proposals) for incoming passive check results. So not a good method for tests anyways. Temporarily removing the code blocks from externalcommandlistener.cpp proves the patch working. API tests
Manual recheck will cause ProcessCheckResult() to immediately update the next check using the retry interval. |
Updated by mfriedrich on 2016-01-20 15:30:03 +00:00
Applied in changeset a51e647. |
Updated by mfriedrich on 2016-01-22 16:12:08 +00:00
|
Updated by mfriedrich on 2016-01-22 16:31:21 +00:00
|
Updated by mfriedrich on 2016-01-22 17:37:02 +00:00
There's still one issue with passive check results - calling UpdateNextCheck() will first set a probably smaller next check time, and after invoking ProcessCheckResult() a separate call to SetNextCheck() is required to ensure the check_interval is used for passively received checks and their freshness. This merely influences all checks where we do not care about them whether being active or passive. It is also worse for active checks - the next check time is updated to check_interval if sent in as passive check result via API / UI. Moving that task inside ProcessCheckResult() ensures that subsequent calls to SetNextCheck() happen with the correct check interval for passive check results. The active checks are then rescheduled based on their current state, especially if changed from sending a check result enforcing a state change, e.g. HARD OK, or SOFT NOT-OK with retry_interval being required. Furthermore updating the next check time before sending a DB IDO update for OnNewCheckResult will also help reduce duplicated next_check queries (see linked #11019). |
Updated by mfriedrich on 2016-01-22 18:12:39 +00:00
Applied in changeset 2a11b27. |
Updated by mfriedrich on 2016-02-18 10:24:53 +00:00
|
Updated by mfriedrich on 2016-02-25 13:59:06 +00:00
|
Updated by mfriedrich on 2016-03-09 10:34:46 +00:00
|
Updated by mfriedrich on 2016-03-09 10:36:02 +00:00 The patch was broken in many ways, causing problems with actively scheduled checks, passive checks, etc. The original fix has been reverted, and a proper fix is discussed in #11336. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/7287
Created by wimmer on 2014-09-20 15:16:39 +00:00
Assignee: mfriedrich
Status: Resolved (closed on 2016-01-22 18:12:39 +00:00)
Target Version: 2.4.2
Last Update: 2016-03-09 10:36:02 +00:00 (in Redmine)
I found strange behavior of re-checks scheduling.
I use Icinga1 and Icinga2 Debian packages and I use the same attributes for checks:
When some service changes its status, Icinga1 works as expected:
But Icinga2 uses strange Attempt numbering [there is no 1/3 (SOFT state)]
and bad time for scheduling of first re-check (15:10 instead of 15:06).
It looks like Icinga2 misses "1/3 SOFT state" phase.
There is wrong scheduling in CRITICAL/HARD states too (compare with Icinga1 times):
Attachments
Changesets
2015-03-11 15:33:36 +00:00 by mfriedrich 05c237c
2015-08-21 08:24:49 +00:00 by mfriedrich 6f252bb
2016-01-20 15:29:01 +00:00 by mfriedrich a51e647
2016-01-22 17:40:14 +00:00 by mfriedrich 2a11b27
2016-02-23 08:24:57 +00:00 by mfriedrich 9ca7245
2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61
2016-03-05 17:16:49 +00:00 by mfriedrich ef532f2
2016-03-11 14:55:03 +00:00 by mfriedrich 8344f74
2016-03-11 14:55:14 +00:00 by mfriedrich f99feab
Relations:
The text was updated successfully, but these errors were encountered: