[dev.icinga.com #7277] reload while check executing causes immediate DOWN;HARD, skipping remaining attempts #1519
Comments
Updated by mckslim on 2014-09-29 15:34:51 +00:00 anyone out there to look at my problem? |
Updated by mfriedrich on 2014-10-24 22:24:16 +00:00 Sorry, but the core dev team is currently bound to Icinga 2 and its upcoming 2.2 release. It won't help much asking when someone will look into it - currently no-one will do, so you might give it a go on your own within the source code. I suspect that #4494 is a similar issue, which has been rejected a while ago working as designed. |
Updated by mfriedrich on 2014-10-24 22:25:54 +00:00
|
Updated by mfriedrich on 2014-10-24 22:26:23 +00:00
|
Updated by mfriedrich on 2014-10-24 22:27:26 +00:00
|
Updated by mckslim on 2014-10-24 23:09:05 +00:00 dnsmichi wrote:
Thanks for the reply. We'll be patient hoping you can get to this bug in the future. I see the other bugs you put in as being possibly related. I don't know if I can figure things out from there. It would certainly be great if any of you Icinga developers could still figure it out of course, when possible. |
Updated by mfriedrich on 2014-10-24 23:34:46 +00:00 I currently have no idea how to fix that race condition, since the 1.x reload isn't really one compared to how apache2 or icinga2 does it. A SIGHUP signal stops the entire process (event queue), re-reads the configuration, and then starts to do checks again. if the currently executed check in its child process returns within that small time window, it writes the checkresult to the spooldir, where the reaper then gets the result from in a defined polling interval. meanwhile the reload operation must be finished, so icinga core has no information at that point that the previously scheduled check returned ok. My guess is that's impossible to store/control what to do with child processes still running during a reload - killing them in favor of the reload action isn't option, neither is waiting for them to return (imagine long lasting checks). Having that said, the Nagios inherited code as a fork got many flaws which are either unfixable as they are broken by design, or remain being bugs for a long time as they require long term debugging and the fixes are not trivial in that scope. That's something at least I am not willing to do in my spare time. Icinga 2 is more fun to play with, and that's it for now. The reload in there is not broken, and works in a reasonable parallel manner Since you requested feedback twice, I thought I'd give you an update (it's nearly 2am here), but in the end my todo list says otherwise in the next weeks. I won't take this bug into account for 1.12 either, unless someone provides a fully tested patch in the next 2 weeks. |
Updated by mfriedrich on 2014-10-26 20:01:39 +00:00 One thought on the reload and checkresults:
it will break out of its operation when sigrestart == true
Generally speaking that function calls process_check_result_queue() first to read all files into core memory. process_check_result_file() reads the file, and add_check_result_to_list(new_cr); will put it into the list. That list is processed by the reaper then. If there's a restart signal coming in, processing is interrupted. The checkresult reaper will continue its operation only within the next event for reaping the checkresults. Which means that this list might be longer with new checkresults. There could be some limiting settings here:
I'm not sure if that's the real issue, but it could be a problem in terms of processing checkresults in the proper way. Icinga 1.x cannot be changed to read checkresults into memory and directly update the hoststatus. That's one of the reasons for Icinga 2 where this is properly implemented. For further insights, please attach your icinga.cfg and a debug log with at least the following levels enabled:
and add the following details
|
Updated by mfriedrich on 2014-10-26 20:03:14 +00:00
|
Updated by mckslim on 2014-10-29 21:10:57 +00:00 OK, I enabled debug log with this in 'icinga.log':
I have debug log with the problem in it, what do you want to see? and ...
System health: |
Updated by mckslim on 2014-10-29 21:16:25 +00:00 Here are relevant logs of the problem. See the attempt numbers before reload (which occurs at 16:02), and then the numbers after reload in the debug log - the attempts jump to the max for some reason, and thus the state type goes HARD, and thus notifs. icinga.log: icinga.debug.log: |
Updated by mfriedrich on 2014-10-29 23:42:40 +00:00 100k servicechecks. wow. I wasn't aware that a single 1.x core can handle that, not even with mod_gearman (our benchmarks show that 80k is already a problem in terms of latency). Regarding the installation - there are different ways to do it in 1.x (source, packages), so I am interested in a possible way of testing stuff (be it patches, or trying to reproduce stuff). I can't access your systems, so trying fetch as much information as possible. In terms of the debug log, I'd like to generally see it (probably you might want to obfuscate stuff, if not, use michael.friedrich(at)netways.de instead of attaching here). It's hard to describe what I am looking for, or what's inside. But mod_gearman brings a new problem into play - how the checks are being feed into the core. mod_gearman does that in-memory by merging the checkresult list (which is unsafe as you cannot control how many checkresults are stashed into the list in memory). Although a reload must not clear the core's memory - and clean the checkresult list which is to be processed - but that list may grow quite a lot. Could be mistaken, and it's a general problem - but that would be seen in general installations, where it does not. Maybe only under heavy load. Looking at your reload time -
That's 191 seconds where the core does nothing. No check process, no scheduling. As far as I can see, mod_gearman starts it's operation directly again, but waits for the checkresult timed event, and stays idle before merging back any checkresults. Although the remaining old checks fill up the gearman check result queue meanwhile. Would be interesting to know how many are waiting during that reload window. Other than that, I'd be interested in the mod_gearman version. If the problem is located there, the developer might want to know as well. Plus the used gearmand job server version. Testing without mod_gearman sounds reasonable, but not realistic... (offoptic question - when are you considering migrating to icinga2? your environment would fit a single instance running all checks by itself perfectly fine. and i could help you better, debugging it) |
Updated by mckslim on 2014-10-30 20:09:42 +00:00 On our installation, we built from source. On debug.log, it is 300mb for 30m of data - what part would you want? I'll assume the 10m around the reload time, and just send that for now, probably in separate email if i need to gearman versions: $ /opt/mod_gearman/bin/gearman_top -V $ /opt/mod_gearman/bin/mod_gearman_worker -V On the reload time, that is awhile, yes; we suppose its because we have so many hosts/services in the cfg. On migrating to icinga2, we haven't thought much about that yet, will consider next year. |
Updated by mckslim on 2014-10-31 16:21:06 +00:00 fyi that I sent this file to 'michael.friedrich@netways.de' yesterday: |
Updated by mfriedrich on 2014-10-31 16:30:06 +00:00 Seems your company does not like the content. Did get a rejected mail.
|
Updated by mckslim on 2014-10-31 17:09:04 +00:00 i tried sending it again |
Updated by mfriedrich on 2015-02-14 16:03:49 +00:00 Sorry I lost a bit track here, and did not receive any debug log. Since your checks are actively scheduled, it does not compare well to the "passive host checks are soft config" option I had in mind when looking at a customers problem lately which looked similar. I'd say you really must consider looking into Icinga 2. Especially when 2.3 is out, imagine what you can do with it... Apart from that, try the Vagrant boxes. I've upgraded them to CentOS7 last weekend. |
Updated by berk on 2015-05-18 12:18:09 +00:00
|
Updated by mfriedrich on 2015-08-04 18:59:49 +00:00
There's not much I can do about it, and further feedback possibly won't help much (other than debugging it directly on the box obviously). I'm leaving this open for the time being if others are running in the same problem. |
You might want to try this patch: NagiosEnterprises/nagioscore@d318789 |
This was fixed in d2cf765 already. |
This issue has been migrated from Redmine: https://dev.icinga.com/issues/7277
Created by mckslim on 2014-09-18 20:03:13 +00:00
Assignee: (none)
Status: New
Target Version: Backlog
Last Update: 2015-08-04 18:59:49 +00:00 (in Redmine)
Here is an example:
[2014-09-16 07:56:11] HOST ALERT: P00000008057091;DOWN;SOFT;14;CRITICAL: Connection timed out to '50.73.129.57' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 14 of 19. . Traceroute completed, increased max_attempts to 19.
(another check starts in here)
[2014-09-16 08:01:59] Caught SIGHUP, restarting...
(check completes in here)
[2014-09-16 08:05:17] Event loop started...
[2014-09-16 08:05:18] HOST ALERT: P00000008057091;DOWN;HARD;19;CRITICAL: Connection timed out to '50.73.129.57' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 15 of 19. . Traceroute completed, increased max_attempts to 19.
[2014-09-16 08:05:18] HOST NOTIFICATION: soc;P00000008057091;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 50.73.129.57 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 15 of 19. . Traceroute completed, increased max_attempts to 19.
See that the check before reload (SIGHUP) was attempt 14 of 19.
And see that check result after reload was DOWN;HARD;19, thus it went HARD on attempt 15 before getting to attempt 19.
Why is this happening? This causes premature NOTIFS to occur, which increases the amount of problems we have to process.
Relations:
The text was updated successfully, but these errors were encountered: