Skip to content
This repository has been archived by the owner on Jan 15, 2019. It is now read-only.

[dev.icinga.com #7277] reload while check executing causes immediate DOWN;HARD, skipping remaining attempts #1519

Closed
icinga-migration opened this issue Sep 18, 2014 · 21 comments

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/7277

Created by mckslim on 2014-09-18 20:03:13 +00:00

Assignee: (none)
Status: New
Target Version: Backlog
Last Update: 2015-08-04 18:59:49 +00:00 (in Redmine)

Icinga Version: 1.11.4
OS Version: Red Hat Enterprise Linux Server release 5.10 (Tikanga)

Here is an example:
[2014-09-16 07:56:11] HOST ALERT: P00000008057091;DOWN;SOFT;14;CRITICAL: Connection timed out to '50.73.129.57' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 14 of 19. . Traceroute completed, increased max_attempts to 19.
(another check starts in here)
[2014-09-16 08:01:59] Caught SIGHUP, restarting...
(check completes in here)
[2014-09-16 08:05:17] Event loop started...
[2014-09-16 08:05:18] HOST ALERT: P00000008057091;DOWN;HARD;19;CRITICAL: Connection timed out to '50.73.129.57' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 15 of 19. . Traceroute completed, increased max_attempts to 19.
[2014-09-16 08:05:18] HOST NOTIFICATION: soc;P00000008057091;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 50.73.129.57 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 15 of 19. . Traceroute completed, increased max_attempts to 19.

See that the check before reload (SIGHUP) was attempt 14 of 19.
And see that check result after reload was DOWN;HARD;19, thus it went HARD on attempt 15 before getting to attempt 19.
Why is this happening? This causes premature NOTIFS to occur, which increases the amount of problems we have to process.


Relations:

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-09-29 15:34:51 +00:00

anyone out there to look at my problem?

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-24 22:24:16 +00:00

Sorry, but the core dev team is currently bound to Icinga 2 and its upcoming 2.2 release. It won't help much asking when someone will look into it - currently no-one will do, so you might give it a go on your own within the source code.

I suspect that #4494 is a similar issue, which has been rejected a while ago working as designed.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-24 22:25:54 +00:00

  • Relates set to 4494

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-24 22:26:23 +00:00

  • Relates set to 3899

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-24 22:27:26 +00:00

  • Relates set to 6480

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-10-24 23:09:05 +00:00

dnsmichi wrote:

Sorry, but the core dev team is currently bound to Icinga 2 and its upcoming 2.2 release. It won't help much asking when someone will look into it - currently no-one will do, so you might give it a go on your own within the source code.

I suspect that #4494 is a similar issue, which has been rejected a while ago working as designed.

Thanks for the reply. We'll be patient hoping you can get to this bug in the future.
On 4944, that is the same basic issue. In that case I was doing things with putting the host in passive mode to try to recreate it on the fly more easily, to which you said it was working as designed in that case. But, this 7277 report is on the issue when things are using normal processing, not when I use passive checking; so, it is more valid as a real bug.

I see the other bugs you put in as being possibly related. I don't know if I can figure things out from there. It would certainly be great if any of you Icinga developers could still figure it out of course, when possible.
thanks again

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-24 23:34:46 +00:00

I currently have no idea how to fix that race condition, since the 1.x reload isn't really one compared to how apache2 or icinga2 does it. A SIGHUP signal stops the entire process (event queue), re-reads the configuration, and then starts to do checks again. if the currently executed check in its child process returns within that small time window, it writes the checkresult to the spooldir, where the reaper then gets the result from in a defined polling interval. meanwhile the reload operation must be finished, so icinga core has no information at that point that the previously scheduled check returned ok. My guess is that's impossible to store/control what to do with child processes still running during a reload - killing them in favor of the reload action isn't option, neither is waiting for them to return (imagine long lasting checks).

Having that said, the Nagios inherited code as a fork got many flaws which are either unfixable as they are broken by design, or remain being bugs for a long time as they require long term debugging and the fixes are not trivial in that scope. That's something at least I am not willing to do in my spare time. Icinga 2 is more fun to play with, and that's it for now. The reload in there is not broken, and works in a reasonable parallel manner Since you requested feedback twice, I thought I'd give you an update (it's nearly 2am here), but in the end my todo list says otherwise in the next weeks. I won't take this bug into account for 1.12 either, unless someone provides a fully tested patch in the next 2 weeks.

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-26 20:01:39 +00:00

One thought on the reload and checkresults:

  • a SIGHUP in icinga 1.x (or nagios) is handled as async signal, setting the global variable "sigrestart" to true. Basically the same which happens when sending RESTART_PROCESS through the external command pipe.
  • that global variable is read at several places, it prevents further even execution for instances in base/events.c
  • if the check result reaper is currently running (a cyclic event that is called from the event queue as well)
        case EVENT_CHECK_REAPER:

                log_debug_info(DEBUGL_EVENTS, 0, "** Check Result Reaper\n");

                /* reap host and service check results */
                reap_check_results();
                break;

it will break out of its operation when sigrestart == true

int reap_check_results(void) {
...
        /* read all check results that have come in... */
        while ((queued_check_result = read_check_result())) {
...
                /* bail out if we encountered a signal */
                if (sigshutdown == TRUE || sigrestart == TRUE) {
                        log_debug_info(DEBUGL_CHECKS, 0, "Breaking out of check result reaper: signal encountered\n");
                        break;
                }
        }

        log_debug_info(DEBUGL_CHECKS, 0, "Finished reaping %d check results\n", reaped_checks);
        log_debug_info(DEBUGL_FUNCTIONS, 0, "reap_check_results() end\n");

        return OK;
}

Generally speaking that function calls process_check_result_queue() first to read all files into core memory. process_check_result_file() reads the file, and add_check_result_to_list(new_cr); will put it into the list. That list is processed by the reaper then.

If there's a restart signal coming in, processing is interrupted. The checkresult reaper will continue its operation only within the next event for reaping the checkresults. Which means that this list might be longer with new checkresults.

There could be some limiting settings here:

  • check_result_reaper_frequency=10 (default with 10 seconds could be to long, try decreasing the value to 5 or 1s)
  • max_check_result_file_age=3600 (if this is set to some seconds, instead of 1h by default, the core will remove the old checkresults from memory)
  • max_check_result_list_items=1024 (this setting should be used to read checkresult in batches, which ensures that only a certain number to checkresult files is read into core memory in process_check_result_queue())

I'm not sure if that's the real issue, but it could be a problem in terms of processing checkresults in the proper way. Icinga 1.x cannot be changed to read checkresults into memory and directly update the hoststatus. That's one of the reasons for Icinga 2 where this is properly implemented.

For further insights, please attach your icinga.cfg and a debug log with at least the following levels enabled:

#          4 = Process information
#          8 = Scheduled events
#          16 = Host/service checks

and add the following details

  • Icinga Version and how installed
  • how many checks (hosts, services per 5m for example)
  • any installed addons and event broker modules
  • system health

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-26 20:03:14 +00:00

  • Category set to Check Results
  • Status changed from New to Feedback
  • Assigned to set to mckslim

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-10-29 21:10:57 +00:00

OK, I enabled debug log with this in 'icinga.log':

  1. for debugging https://dev.icinga.org/issues/7277
  2. 4 = Process information
  3. 8 = Scheduled events
  4. 16 = Host/service checks
    debug_level=28

I have debug log with the problem in it, what do you want to see?

and ...

  • Icinga version: 1.11.4 (also see in the ticket info above)
  • How installed: umm, compiled on our systems, and installed using normal methods, not sure what you are looking for on this
  • 6000 host/98000 service checks, lots happening each 5m, i'd guess 5-10000 checks in this timeframe
  • modules used: idomod, mod_gearman

System health:
top - 21:09:43 up 3 days, 20:17, 1 user, load average: 0.46, 0.55, 0.70
Tasks: 429 total, 2 running, 427 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.5%us, 1.2%sy, 0.0%ni, 94.9%id, 0.1%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 41157752k total, 8310168k used, 32847584k free, 1610464k buffers
Swap: 4192956k total, 0k used, 4192956k free, 4542864k cached

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-10-29 21:16:25 +00:00

Here are relevant logs of the problem. See the attempt numbers before reload (which occurs at 16:02), and then the numbers after reload in the debug log - the attempts jump to the max for some reason, and thus the state type goes HARD, and thus notifs.

icinga.log:
...
[2014-10-29 15:55:47] HOST ALERT: SA0000000738235;DOWN;SOFT;1;CRITICAL: Connection timed out to '77.221.97.61' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 1 of 4.
[2014-10-29 15:57:25] HOST ALERT: P00000008021520;DOWN;SOFT;9;CRITICAL: Connection timed out to '10.2.58.230' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 9 of 19. . Traceroute completed, increased max_attempts to 19.
...
(16:02:03 reload, 16:05:14 event loop started)
...
[2014-10-29 16:05:15] HOST ALERT: SA0000000738235;DOWN;HARD;4;CRITICAL: Connection timed out to '77.221.97.61' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 2 of 4.
[2014-10-29 16:05:15] HOST NOTIFICATION: soc;SA0000000738235;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 77.221.97.61 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 2 of 4.
...
[2014-10-29 16:05:17] HOST ALERT: P00000008021520;DOWN;HARD;19;CRITICAL: Connection timed out to '10.2.58.230' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 10 of 19. . Traceroute completed, increased max_attempts to 19.
[2014-10-29 16:05:17] HOST NOTIFICATION: soc;P00000008021520;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 10.2.58.230 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 10 of 19. . Traceroute completed, increased max_attempts to 19.
...

icinga.debug.log:
...
[2014-10-29 15:55:47.393591] [016.1] [pid=6205] HOST: SA0000000738235, ATTEMPT=1/4, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=1
[2014-10-29 15:55:47.393595] [016.1] [pid=6205] Host was UP.
[2014-10-29 15:55:47.393598] [016.1] [pid=6205] Host is now DOWN/UNREACHABLE.
[2014-10-29 15:55:47.393609] [016.1] [pid=6205] Pre-handle_host_state() Host: SA0000000738235, Attempt=1/4, Type=SOFT, Final State=1
[2014-10-29 15:55:47.393794] [016.1] [pid=6205] Post-handle_host_state() Host: SA0000000738235, Attempt=1/4, Type=SOFT, Final State=1
...
[2014-10-29 15:57:25.744547] [016.1] [pid=6205] HOST: P00000008021520, ATTEMPT=9/19, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[2014-10-29 15:57:25.744551] [016.1] [pid=6205] Host was DOWN/UNREACHABLE.
[2014-10-29 15:57:25.744554] [016.1] [pid=6205] Host is still DOWN/UNREACHABLE.
[2014-10-29 15:57:25.744559] [016.1] [pid=6205] Pre-handle_host_state() Host: P00000008021520, Attempt=9/19, Type=SOFT, Final State=1
[2014-10-29 15:57:25.744736] [016.1] [pid=6205] Post-handle_host_state() Host: P00000008021520, Attempt=9/19, Type=SOFT, Final State=1
...
(reload at 16:02:03)
[2014-10-29 16:02:03.403018] [004.0] [pid=6205] Original UID/GID: 611/611
...
[2014-10-29 16:05:14.102254] [008.1] [pid=6205] Event Check Loop
...
[2014-10-29 16:05:15.411340] [016.1] [pid=6205] HOST: SA0000000738235, ATTEMPT=4/4, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[2014-10-29 16:05:15.411343] [016.1] [pid=6205] Host was DOWN/UNREACHABLE.
[2014-10-29 16:05:15.411346] [016.1] [pid=6205] Host is still DOWN/UNREACHABLE.
[2014-10-29 16:05:15.411349] [016.1] [pid=6205] Pre-handle_host_state() Host: SA0000000738235, Attempt=4/4, Type=HARD, Final State=1
[2014-10-29 16:05:15.922831] [016.1] [pid=6205] Post-handle_host_state() Host: SA0000000738235, Attempt=4/4, Type=HARD, Final State=1
...
[2014-10-29 16:05:17.055666] [016.1] [pid=6205] HOST: P00000008021520, ATTEMPT=19/19, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[2014-10-29 16:05:17.055670] [016.1] [pid=6205] Host was DOWN/UNREACHABLE.
[2014-10-29 16:05:17.055673] [016.1] [pid=6205] Host is still DOWN/UNREACHABLE.
[2014-10-29 16:05:17.055676] [016.1] [pid=6205] Pre-handle_host_state() Host: P00000008021520, Attempt=19/19, Type=HARD, Final State=1
[2014-10-29 16:05:17.590007] [016.1] [pid=6205] Post-handle_host_state() Host: P00000008021520, Attempt=19/19, Type=HARD, Final State=1
...

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-29 23:42:40 +00:00

100k servicechecks. wow. I wasn't aware that a single 1.x core can handle that, not even with mod_gearman (our benchmarks show that 80k is already a problem in terms of latency).

Regarding the installation - there are different ways to do it in 1.x (source, packages), so I am interested in a possible way of testing stuff (be it patches, or trying to reproduce stuff). I can't access your systems, so trying fetch as much information as possible.

In terms of the debug log, I'd like to generally see it (probably you might want to obfuscate stuff, if not, use michael.friedrich(at)netways.de instead of attaching here). It's hard to describe what I am looking for, or what's inside.

But mod_gearman brings a new problem into play - how the checks are being feed into the core. mod_gearman does that in-memory by merging the checkresult list (which is unsafe as you cannot control how many checkresults are stashed into the list in memory).

Although a reload must not clear the core's memory - and clean the checkresult list which is to be processed - but that list may grow quite a lot.

Could be mistaken, and it's a general problem - but that would be seen in general installations, where it does not. Maybe only under heavy load.

Looking at your reload time -

(16:02:03 reload, 16:05:14 event loop started)

That's 191 seconds where the core does nothing. No check process, no scheduling. As far as I can see, mod_gearman starts it's operation directly again, but waits for the checkresult timed event, and stays idle before merging back any checkresults. Although the remaining old checks fill up the gearman check result queue meanwhile. Would be interesting to know how many are waiting during that reload window.

Other than that, I'd be interested in the mod_gearman version. If the problem is located there, the developer might want to know as well. Plus the used gearmand job server version. Testing without mod_gearman sounds reasonable, but not realistic...

(offoptic question - when are you considering migrating to icinga2? your environment would fit a single instance running all checks by itself perfectly fine. and i could help you better, debugging it)

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-10-30 20:09:42 +00:00

On our installation, we built from source.


On debug.log, it is 300mb for 30m of data - what part would you want? I'll assume the 10m around the reload time, and just send that for now, probably in separate email if i need to


gearman versions:
$ /opt/gearman/sbin/gearmand -V
gearmand 1.0.2 - https://bugs.launchpad.net/gearmand

$ /opt/mod_gearman/bin/gearman_top -V
gearman_top: version 1.4.2

$ /opt/mod_gearman/bin/mod_gearman_worker -V
mod_gearman_worker: version 1.4.2 running on libgearman 1.0.2


On the reload time, that is awhile, yes; we suppose its because we have so many hosts/services in the cfg.


On migrating to icinga2, we haven't thought much about that yet, will consider next year.

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-10-31 16:21:06 +00:00

fyi that I sent this file to 'michael.friedrich@netways.de' yesterday:
icinga.debug.201410291620.part.gz

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-10-31 16:30:06 +00:00

Seems your company does not like the content. Did get a rejected mail.

InterScan Messaging Security Suite has detected a message which violated the security settings configured in Policy > Scanning Exceptions. 

@icinga-migration
Copy link
Author

Updated by mckslim on 2014-10-31 17:09:04 +00:00

i tried sending it again

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-02-14 16:03:49 +00:00

Sorry I lost a bit track here, and did not receive any debug log. Since your checks are actively scheduled, it does not compare well to the "passive host checks are soft config" option I had in mind when looking at a customers problem lately which looked similar.

I'd say you really must consider looking into Icinga 2. Especially when 2.3 is out, imagine what you can do with it...
https://www.icinga.org/2015/01/13/new-features-in-icinga-2-3/
https://www.icinga.org/2015/02/11/icinga-2-using-functions-in-custom-attributes/

Apart from that, try the Vagrant boxes. I've upgraded them to CentOS7 last weekend.
https://github.com/icinga/icinga-vagrant

@icinga-migration
Copy link
Author

Updated by berk on 2015-05-18 12:18:09 +00:00

  • Target Version set to Backlog

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-08-04 18:59:49 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted mckslim

There's not much I can do about it, and further feedback possibly won't help much (other than debugging it directly on the box obviously). I'm leaving this open for the time being if others are running in the same problem.

@icinga-migration icinga-migration added this to the Backlog milestone Jan 17, 2017
@dnsmichi
Copy link
Contributor

You might want to try this patch: NagiosEnterprises/nagioscore@d318789

@dnsmichi dnsmichi removed this from the Backlog milestone Dec 19, 2017
@dnsmichi
Copy link
Contributor

This was fixed in d2cf765 already.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants