[dev.icinga.com #4494] reload while check executing causes skip of remaining check attempts #1320
Comments
Updated by mckslim on 2013-07-31 23:16:05 +00:00 And btw, Nagios 3.5.0 has the same problem. |
Updated by mfriedrich on 2013-08-05 17:14:06 +00:00 can you attach the configuration sample for that specific host/service, as well as the part from
and some debug logs as well for the specific region of that check happening including the reload. |
Updated by mckslim on 2013-08-05 20:47:09 +00:00 ok, here's a bunch of info, let me know if you need something else: icinga.log: host defn: from 'objects.cache': 'status.dat', after the down;hard;1 status of the check after reload: 'retention.dat', right after doing the reload in the middle of attempt 3: 'retention.dat', after the check completed after the interim reload: 'icinga.debug': |
Updated by mckslim on 2013-08-15 19:27:13 +00:00 hello, when might you get to fixing this? |
Updated by mfriedrich on 2013-08-15 21:09:05 +00:00 there's various possibilities
i for myself don't have any spare time left for working on core issues in the next time. |
Updated by mfriedrich on 2013-08-15 21:09:53 +00:00 furthermore, you're still running 1.8.4 - you should try to reproduce it with 1.9.3 too. |
Updated by mckslim on 2013-08-21 17:26:46 +00:00 dnsmichi wrote:
Problem occurs on 1.9.3: $ cat /opt/icinga/mgd/var/icinga.log | egrep -v 'STATE|EXT' | egrep 'HOST|SIG' | egrep 'stg-dms|SIG' | format_log_ts.pl | tail |
Updated by mckslim on 2013-08-22 19:17:42 +00:00 Hello again, just wondering quickly: how long before someone on your Icinga team might get to working on this problem? thank you |
Updated by mfriedrich on 2013-08-22 19:51:23 +00:00 we're individuals with a private life and everyone's sharing his/her time with different tasks. i for myself won't look into it soon, unless you convince me that going to grab some beer on the weekend should be exchanged with icinga coding. for the rest - i don't know. and we usually don't give dates nor timestamps when issues are being worked on. it's hard to even follow the lead of the roadmap and linked issues there given the current low resources. maybe you'll ask a company providing professional support for icinga fixing the issue faster, providing a patch which we then push upstream and test. |
Updated by mckslim on 2013-08-23 15:01:35 +00:00 Understood, and thanks to you and the others for working on these projects! |
Updated by mfriedrich on 2013-08-23 15:15:23 +00:00 my guess would be that the sighandler for SIGHUP invokes the config re-read and somewhere the counter (current_attempt struct attribute) gets overridden. might happen when the object is re-created in common/objects.c, or some weird behaviour in base/checks.c on the actual check then, resetting it to 1 / hard due to the former state and now checkresult handling of your returned longlasting check. |
Updated by mfriedrich on 2013-09-27 20:48:24 +00:00 hm, the base/checks.c checkresult handlers only mark passive host checks with current_attempt=1 ... but most likely adjust_host_check_attempt_3x() changes current_attempt to 1 before actually rescheduling the next check - and then the checkresult from the long lasting check is being parsed into. and it does not matter whether this is a normal async host check, or a triggered on demand check - both adjust the current attempt counter by calling adjust_host_check_attempt_3x(). some more verbose debug logs would likely help here shed some light into the problem (though, a check lasting 3min+ should be made a cronjob passing a passive checkresult back to the core imho). |
Updated by mckslim on 2013-09-30 22:12:04 +00:00 I reran with (let me know if you need more debug_level): Here's the icinga.log with my added check info:
Here's the debug, winnowed down to relevant lines:
thank you! |
Updated by mfriedrich on 2013-09-30 22:39:00 +00:00 so the adjustment happens indeed before the actual restart happens.
given the above log entry, the reset already happened before.
that one means that it just incremented the check attempt like on every other check. what's puzzling me is
that would indicate that this host does not have any active checks enabled, but only receives passive checks. and that one before even a reload happens. so this could've happened anywhere between the initial startup and the reload - i.e. via modified_attributes. i quote
conclusion to that - someone disabled active host checks at runtime. and something is then triggering the forced active check - either manually, or a freshness check (which seems disabled from objects.cache pov)? obviously passive checks are treated as hard by default, and that explains the immediate transition from SOFT->HARD and further, the reset of the check attempt, now being a passive check only without the max_check_attempts setting. i'm not sure what you're trying to do here with the passive checks (is that intended on runtime? someone fiddled something wrong?). passive_host_checks_are_soft=1 would change the default passive state to SOFT, but i doubt that it will solve the original issue. so, why are these modified attributes set disabling active host checks at runtime? |
Updated by mckslim on 2013-09-30 23:07:20 +00:00 I purposely disabled active checks before running these tests, just to control what was shown in the log - disregard the fact that active checks are disabled. I'm running the checks manually as needed. |
Updated by mfriedrich on 2013-09-30 23:48:23 +00:00
i cannot disregard them - having the host checks disabled at runtime causes the observed behaviour with the reset attempt counter on a now passive check. i cannot see a bug here anymore when you change the host check's behaviour on purpose. |
Updated by mckslim on 2013-10-01 01:28:53 +00:00 I really don't think it becomes a passive check. I am running the check manually using the web's 'Re-schedule the next check of this host' cmd, which is not a passive check, just an active sort of check run on my own schedule. Looking at 'base/checks.c', this appears to be the important code area:
would changing this line: fix the issue and work ok in general? |
Updated by mfriedrich on 2013-10-01 07:58:46 +00:00 no, no and no. stop here. please answer the following questions
if 1) and 2) can be answered with yes, read on a) look into status.dat before doing anything. you will recognize modified_attributes=0 for that host. that in terms means that your host is now a passively checked host (which may still be forced to check, ie. manually). d) once you force the check, it will handle it as passive. that means, your active host check was converted into a passive host check, behaving totally different to like known before. the reload/restart has nothing to do with your problem now, you've got 2 options
make sure to clean retentation.dat from modified_attributes=2 for that host. set it to modified_attributes=0
conclusion by changing the host check type from active to passive you are changing the core's behaviour on handling the checkresults and state (counters). that is not a bug on the core, but misconfiguration on your side. |
Updated by mckslim on 2013-10-01 23:34:37 +00:00 Wow, that is interesting, thanks for the explanation. So, I guess its working as designed. I'll probably try 1.9.3 now to verify it works the same there. |
Updated by mfriedrich on 2013-10-03 02:12:55 +00:00
it does. and git next with upcoming 1.10.0 does too. if there's anything else, i'll reopen the issue then. |
Updated by mfriedrich on 2014-10-24 22:25:55 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/4494
Created by mckslim on 2013-07-31 22:10:24 +00:00
Assignee: (none)
Status: Rejected (closed on 2013-10-03 02:12:55 +00:00)
Target Version: (none)
Last Update: 2013-10-03 02:12:55 +00:00 (in Redmine)
Happens on 1.8.4 at least, don't know about other versions.
Looking at the log snippet here (max_attempts on this host check is 5):
[2013-07-31 00:41:31] HOST ALERT: atl-stg-dmscol-01a;UP;HARD;1;sup
[2013-07-31 00:41:51] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;1;sean check 30
[2013-07-31 00:42:41] HOST ALERT: atl-stg-dmscol-01a;DOWN;SOFT;2;sean check 30
check begin
[2013-07-31 00:43:18] Caught SIGHUP, restarting...
check end
[2013-07-31 00:43:48] HOST ALERT: atl-stg-dmscol-01a;DOWN;HARD;1;sean check 30
See that after attempt 2, I run the check using the web interface (which begins check script execution), then before the check script ends I reload icinga, then after this the check script ends.
See that the next HOST ALERT has 'DOWN;HARD;1', when it should be 'DOWN;SOFT;3'; this initiates a notification before we really expect one to be generated, which causes more grind for our ops people.
Can you do something to get the above to not prematurely go DOWN;HARD ?
Relations:
The text was updated successfully, but these errors were encountered: