New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #8670] Reloads causing high load #2725
Comments
Updated by mfriedrich on 2015-03-10 15:25:05 +00:00
What's "levitates"? Apart from that, attach the logs from what your monitoring core is doing at that time. And any additional information which would help the user outside your environment to understand your issue. With that little information provided I would rather start a discussion in the community support channels. Furthermore please re-test that against 2.3.0. |
Updated by TheSerapher on 2015-03-11 08:13:38 +00:00 Hey *, We are also seeing this on our current test setup. We have had 2.2.4 running for a while and have been seeing these incredible load spikes with reloads of the core. Now with 2.3 this is still happening. What we discovered so far:
If you need more details, please give some ideas on what logs to attach and if you need debug log enabled (warning: very high volume!). We'd like to see this fixed because more often than not a high load node does not recover properly and does not run any checks at all. Restarting Icinga2 can fix this if the high load phase is passed properly. EDIT: Here the overview of a partially recovered cluster of checkers that have been reloaded. Some nodes show an expected load but one (icinga2-checker-1.dc1) has no load at all and does not run any checks:
Log of that node also shows no more output since the process got stuck. Restarting the node causes high load again with a lot of checks running at the same time (see attached proc_ps.txt). |
Updated by TheSerapher on 2015-03-11 08:36:38 +00:00
File attachment for process list. |
Updated by gtorre on 2015-03-11 14:39:08 +00:00 @dnsmichi that was an auto-correct, I meant "livestatus". As |
Updated by mfriedrich on 2015-03-11 14:42:52 +00:00 Hm, from an idea I had during sleep: The reload spawns a new child doing the configuration validation and then taking over. If that reload period takes longer (seconds to minutes in large environments) while the "normal" checks and other events are fired by the parent, I could imagine that this causes load peaks. Furthermore the configuration validation has been improved with 2.3 and is now faster (much more parallel, but also taking more resources). It probably would help see graphs where load peaks and reload times are put together. |
Updated by TheSerapher on 2015-03-11 14:55:03 +00:00
dnsmichi wrote:
That is interesting. So from what I read, the main process keeps running as before and executes the regular checks while a new child is doing the validation of the config and then takes over from the parent? I have yet another checker that just seemed to have stopped working (no load and no checks fired). It's log only shows:
And nothing after that. I attached the gdb.txt for this node. Now lets check the load figures:
No load at all as expected, sind this node isn't doing anything anymore. According to the idea you had, this node should continue to not have any load since even though we are spawning a new process, the checks have stopped in the parent:
The reload indeed took a few seconds with a new process hanging at 100% CPU usage for a few seconds. But since the parent has not done any checks, we did not see any change in load. Good! Now to restart this node. We kill the process and start new:
Instant spike in load! Not only that, but we can also see a lot of checks being run simultaneously again. I am not sure if this is intended (flush state caches and re-load service and host states by checking them all?), but I'd think that a Checker should only run any checks if triggered by a Master via API? Or is our (single) master sending all these requests to run checks? If you need any more details let me know. EDIT: Load is now back to normal and checks are running:
|
Updated by gtorre on 2015-03-11 14:55:37 +00:00
We run the icinga reload on a cron every 10 minutes.. I've attached logs showing what time the reload script kicked off. Note, I only provided the time stamps when Icinga reloaded and there were configuration changes in our repo. |
Updated by TheSerapher on 2015-03-11 15:10:09 +00:00 Small update, all nodes seem to be running normal right now:
|
Updated by TheSerapher on 2015-03-12 14:42:08 +00:00
Scratch my previous comment, it happened on the hardware box too. We stopped the Icinga2 core and started it again, system went into 100% CPU Usage on all 24 cores. Load spiked to 70 but was, obviously, dealt with much faster than a 4 core node. Then I tried to delete the icinga2.state file by stoping the service, removing the file and starting it again. Same behaviour (which I rather expected). Then stopped the service, not touching the state file, and started again. Same thing. For starters, it would be good to know why Icinga spawns that many processes when starting up even though there seems to be no reason to do that in a checker, that should wait for masters to tell them what to check. Edit: Still seen on 2.3.1/2.3.2 |
Updated by gtorre on 2015-03-12 15:50:26 +00:00 So there might have been something external causing this high load.. we have no idea. Someone might have been running something against the box which caused this. We haven't seen the crazy spikes since my last post. If it happens again I'll update the thread. |
Updated by TheSerapher on 2015-03-13 08:56:18 +00:00 It's noting external for us, those loads happen pretty with each reloaded and restart. |
Updated by TheSerapher on 2015-03-16 12:24:15 +00:00 I have checked if the checker nodes are sending any of these results to the master but it doesn't look like they do. One startup is done and the regular checks start running, we can see results coming in to the master and being stored in the IDO. I think it would be good if there is a way for a checker node to not produce these excessive checks during startup. If you need any further help on working on this let me know, I can run snapshots with live data. |
Updated by mfriedrich on 2015-03-19 10:01:04 +00:00
|
Updated by mfriedrich on 2015-06-23 13:26:17 +00:00
|
Updated by mfriedrich on 2015-08-06 20:34:37 +00:00
|
Updated by gtorre on 2016-01-21 15:12:58 +00:00 We've been experiencing the same issue for the last few months. Is there anything we can do to help with the investigation process? |
Updated by mfriedrich on 2016-02-25 00:23:15 +00:00 Re-test with 2.4.3. |
Updated by mnardin on 2016-02-29 17:27:09 +00:00 I can confirm this behavior as well running 2.4.3. |
Updated by mfriedrich on 2016-03-04 15:30:34 +00:00
|
Updated by kunsjef on 2016-04-20 10:37:56 +00:00 We have the same issue in version r2.4.4-1. One master, 4 checkers. All checkers have 8 CPU cores and 16GB RAM. We have 2500 hosts and 8000 services. All checkers spike when we reload, and the load is in the thousands before checkers start crashing. We can't even start Icinga2 with all our hosts monitored. If we reduce the number to 600-700 hosts, the spikes last just for a short while and then the load normalizes. |
Updated by gbeutner on 2016-05-11 06:46:33 +00:00
This might have been solved by fixing #8137. Please re-test with the current master. |
Updated by mfriedrich on 2016-08-08 15:49:54 +00:00
Please re-test with the current snapshot packages and/or release packages once 2.5 is finally released. Although I do believe that #8137 should already tackle your issue. |
Updated by saurabh_hirani on 2016-08-10 09:14:27 +00:00 Thanks for the productive discussion on this issue everyone. I am seeing similar scenarios when we do an icinga2 reload, the validation processes cause a major load spike (going from 2 to 11 on a 12 core machine). Will keep this thread posted with observations and steps taken. |
Updated by mfriedrich on 2016-11-09 14:59:54 +00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8670
Created by gtorre on 2015-03-10 15:15:02 +00:00
Assignee: (none)
Status: Closed (closed on 2016-08-08 15:49:54 +00:00)
Target Version: (none)
Last Update: 2016-11-09 14:59:54 +00:00 (in Redmine)
We recently upgraded to 2.2.4, shortly after we started seeing extremely high load (load average: 527.16, 277.33, 173.62) when Icinga2 reloads. There are about 4,200 nodes in production, with about 94,000 checks running. We have the following Icinga2 features enabled: checker, command, compatlog, icingastatus, levitates, and notification.
Attachments
The text was updated successfully, but these errors were encountered: